aboutsummaryrefslogtreecommitdiff
path: root/arch/x86/kvm
AgeCommit message (Collapse)AuthorFilesLines
2022-06-27KVM: VMX: Fix IBRS handling after vmexitJosh Poimboeuf1-1/+6
For legacy IBRS to work, the IBRS bit needs to be always re-written after vmexit, even if it's already on. Signed-off-by: Josh Poimboeuf <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Borislav Petkov <[email protected]>
2022-06-27KVM: VMX: Prevent guest RSB poisoning attacks with eIBRSJosh Poimboeuf4-31/+68
On eIBRS systems, the returns in the vmexit return path from __vmx_vcpu_run() to vmx_vcpu_run() are exposed to RSB poisoning attacks. Fix that by moving the post-vmexit spec_ctrl handling to immediately after the vmexit. Signed-off-by: Josh Poimboeuf <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Borislav Petkov <[email protected]>
2022-06-27KVM: VMX: Convert launched argument to flagsJosh Poimboeuf5-9/+31
Convert __vmx_vcpu_run()'s 'launched' argument to 'flags', in preparation for doing SPEC_CTRL handling immediately after vmexit, which will need another flag. This is much easier than adding a fourth argument, because this code supports both 32-bit and 64-bit, and the fourth argument on 32-bit would have to be pushed on the stack. Note that __vmx_vcpu_run_flags() is called outside of the noinstr critical section because it will soon start calling potentially traceable functions. Signed-off-by: Josh Poimboeuf <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Borislav Petkov <[email protected]>
2022-06-27KVM: VMX: Flatten __vmx_vcpu_run()Josh Poimboeuf1-73/+46
Move the vmx_vm{enter,exit}() functionality into __vmx_vcpu_run(). This will make it easier to do the spec_ctrl handling before the first RET. Signed-off-by: Josh Poimboeuf <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Borislav Petkov <[email protected]>
2022-06-27x86: Add magic AMD return-thunkPeter Zijlstra1-0/+18
Note: needs to be in a section distinct from Retpolines such that the Retpoline RET substitution cannot possibly use immediate jumps. ORC unwinding for zen_untrain_ret() and __x86_return_thunk() is a little tricky but works due to the fact that zen_untrain_ret() doesn't have any stack ops and as such will emit a single ORC entry at the start (+0x3f). Meanwhile, unwinding an IP, including the __x86_return_thunk() one (+0x40) will search for the largest ORC entry smaller or equal to the IP, these will find the one ORC entry (+0x3f) and all works. [ Alexandre: SVM part. ] [ bp: Build fix, massages. ] Suggested-by: Andrew Cooper <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Josh Poimboeuf <[email protected]> Signed-off-by: Borislav Petkov <[email protected]>
2022-06-27x86/kvm: Fix SETcc emulation for return thunksPeter Zijlstra1-13/+15
Prepare the SETcc fastop stuff for when RET can be larger still. The tricky bit here is that the expressions should not only be constant C expressions, but also absolute GAS expressions. This means no ?: and 'true' is ~0. Also ensure em_setcc() has the same alignment as the actual FOP_SETCC() ops, this ensures there cannot be an alignment hole between em_setcc() and the first op. Additionally, add a .skip directive to the FOP_SETCC() macro to fill any remaining space with INT3 traps; however the primary purpose of this directive is to generate AS warnings when the remaining space goes negative. Which is a very good indication the alignment magic went side-ways. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Josh Poimboeuf <[email protected]> Signed-off-by: Borislav Petkov <[email protected]>
2022-06-27x86/kvm/vmx: Make noinstr cleanPeter Zijlstra2-5/+5
The recent mmio_stale_data fixes broke the noinstr constraints: vmlinux.o: warning: objtool: vmx_vcpu_enter_exit+0x15b: call to wrmsrl.constprop.0() leaves .noinstr.text section vmlinux.o: warning: objtool: vmx_vcpu_enter_exit+0x1bf: call to kvm_arch_has_assigned_device() leaves .noinstr.text section make it all happy again. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Borislav Petkov <[email protected]>
2022-06-25KVM: x86/mmu: Buffer nested MMU split_desc_cache only by default capacitySean Christopherson1-7/+15
Buffer split_desc_cache, the cache used to allcoate rmap list entries, only by the default cache capacity (currently 40), not by doubling the minimum (513). Aliasing L2 GPAs to L1 GPAs is uncommon, thus eager page splitting is unlikely to need 500+ entries. And because each object is a non-trivial 128 bytes (see struct pte_list_desc), those extra ~500 entries means KVM is in all likelihood wasting ~64kb of memory per VM. Link: https://lore.kernel.org/all/YrTDcrsn0%[email protected] Cc: David Matlack <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-25KVM: x86/mmu: Use "unsigned int", not "u32", for SPTEs' @access infoSean Christopherson1-4/+6
Use an "unsigned int" for @access parameters instead of a "u32", mostly to be consistent throughout KVM, but also because "u32" is misleading. @access can actually squeeze into a u8, i.e. doesn't need 32 bits, but is as an "unsigned int" because sp->role.access is an unsigned int. No functional change intended. Link: https://lore.kernel.org/all/YqyZxEfxXLsHGoZ%[email protected] Cc: David Matlack <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-24KVM: SEV-ES: reuse advance_sev_es_emulated_ins for OUT tooPaolo Bonzini1-10/+9
complete_emulator_pio_in() only has to be called by complete_sev_es_emulated_ins() now; therefore, all that the function does now is adjust sev_pio_count and sev_pio_data. Which is the same for both IN and OUT. No functional change intended. Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-24KVM: x86: de-underscorify __emulator_pio_inPaolo Bonzini1-15/+9
Now all callers except emulator_pio_in_emulated are using __emulator_pio_in/complete_emulator_pio_in explicitly. Move the "either copy the result or attempt PIO" logic in emulator_pio_in_emulated, and rename __emulator_pio_in to just emulator_pio_in. Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-24KVM: x86: wean fast IN from emulator_pio_inPaolo Bonzini1-1/+1
Use __emulator_pio_in() directly for fast PIO instead of bouncing through emulator_pio_in() now that __emulator_pio_in() fills "val" when handling in-kernel PIO. vcpu->arch.pio.count is guaranteed to be '0', so this a pure nop. emulator_pio_in_emulated is now the last caller of emulator_pio_in. No functional change intended. Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-24KVM: x86: wean in-kernel PIO from vcpu->arch.pio*Paolo Bonzini1-41/+32
Make emulator_pio_in_out operate directly on the provided buffer as long as PIO is handled inside KVM. For input operations, this means that, in the case of in-kernel PIO, __emulator_pio_in() does not have to be always followed by complete_emulator_pio_in(). This affects emulator_pio_in() and kvm_sev_es_ins(); for the latter, that is why the call moves from advance_sev_es_emulated_ins() to complete_sev_es_emulated_ins(). For output, it means that vcpu->pio.count is never set unnecessarily and there is no need to clear it; but also vcpu->pio.size must not be used in kvm_sev_es_outs(), because it will not be updated for in-kernel OUT. Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-24KVM: x86: move all vcpu->arch.pio* setup in emulator_pio_in_out()Paolo Bonzini2-9/+14
For now, this is basically an excuse to add back the void* argument to the function, while removing some knowledge of vcpu->arch.pio* from its callers. The WARN that vcpu->arch.pio.count is zero is also extended to OUT operations. The vcpu->arch.pio* fields still need to be filled even when the PIO is handled in-kernel as __emulator_pio_in() is always followed by complete_emulator_pio_in(). But after fixing that, it will be possible to to only populate the vcpu->arch.pio* fields on userspace exits. No functional change intended. Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-24KVM: x86: drop PIO from unregistered devicesPaolo Bonzini1-3/+13
KVM protects the device list with SRCU, and therefore different calls to kvm_io_bus_read()/kvm_io_bus_write() can very well see different incarnations of kvm->buses. If userspace unregisters a device while vCPUs are running there is no well-defined result. This patch applies a safe fallback by returning early from emulator_pio_in_out(). This corresponds to returning zeroes from IN, and dropping the writes on the floor for OUT. Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-24KVM: x86: inline kernel_pio into its sole callerPaolo Bonzini1-22/+16
The caller of kernel_pio already has arguments for most of what kernel_pio fishes out of vcpu->arch.pio. This is the first step towards ensuring that vcpu->arch.pio.* is only used when exiting to userspace. No functional change intended. Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-24KVM: x86: complete fast IN directly with complete_emulator_pio_in()Paolo Bonzini1-5/+1
Use complete_emulator_pio_in() directly when completing fast PIO, there's no need to bounce through emulator_pio_in(): the comment about ECX changing doesn't apply to fast PIO, which isn't used for string I/O. No functional change intended. Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-24KVM: x86: nSVM: optimize svm_set_x2apic_msr_interceptionMaxim Levitsky3-0/+17
- Avoid toggling the x2apic msr interception if it is already up to date. - Avoid touching L0 msr bitmap when AVIC is inhibited on entry to the guest mode, because in this case the guest usually uses its own msr bitmap. Later on VM exit, the 1st optimization will allow KVM to skip touching the L0 msr bitmap as well. Reviewed-by: Suravee Suthikulpanit <[email protected]> Tested-by: Suravee Suthikulpanit <[email protected]> Signed-off-by: Maxim Levitsky <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-24KVM: SVM: Add AVIC doorbell tracepointSuravee Suthikulpanit3-1/+22
Add a tracepoint to track number of doorbells being sent to signal a running vCPU to process IRQ after being injected. Reviewed-by: Maxim Levitsky <[email protected]> Signed-off-by: Suravee Suthikulpanit <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-24KVM: SVM: Use target APIC ID to complete x2AVIC IRQs when possibleSuravee Suthikulpanit1-3/+18
For x2AVIC, the index from incomplete IPI #vmexit info is invalid for logical cluster mode. Only ICRH/ICRL values can be used to determine the IPI destination APIC ID. Since QEMU defines guest physical APIC ID to be the same as vCPU ID, it can be used to quickly identify the target vCPU to deliver IPI, and avoid the overhead from searching through all vCPUs to match the target vCPU. Reviewed-by: Maxim Levitsky <[email protected]> Signed-off-by: Suravee Suthikulpanit <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-24KVM: x86: Warning APICv inconsistency only when vcpu APIC mode is validSuravee Suthikulpanit1-1/+2
When launching a VM with x2APIC and specify more than 255 vCPUs, the guest kernel can disable x2APIC (e.g. specify nox2apic kernel option). The VM fallbacks to xAPIC mode, and disable the vCPU ID 255 and greater. In this case, APICV is deactivated for the disabled vCPUs. However, the current APICv consistency warning does not account for this case, which results in a warning. Therefore, modify warning logic to report only when vCPU APIC mode is valid. Reviewed-by: Maxim Levitsky <[email protected]> Signed-off-by: Suravee Suthikulpanit <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-24KVM: SVM: Introduce hybrid-AVIC modeSuravee Suthikulpanit2-11/+11
Currently, AVIC is inhibited when booting a VM w/ x2APIC support. because AVIC cannot virtualize x2APIC MSR register accesses. However, the AVIC doorbell can be used to accelerate interrupt injection into a running vCPU, while all guest accesses to x2APIC MSRs will be intercepted and emulated by KVM. With hybrid-AVIC support, the APICV_INHIBIT_REASON_X2APIC is no longer enforced. Suggested-by: Maxim Levitsky <[email protected]> Reviewed-by: Maxim Levitsky <[email protected]> Signed-off-by: Suravee Suthikulpanit <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-24KVM: SVM: Do not throw warning when calling avic_vcpu_load on a running vcpuSuravee Suthikulpanit1-1/+0
Originalliy, this WARN_ON is designed to detect when calling avic_vcpu_load() on an already running vcpu in AVIC mode (i.e. the AVIC is_running bit is set). However, for x2AVIC, the vCPU can switch from xAPIC to x2APIC mode while in running state, in which the avic_vcpu_load() will be called from svm_refresh_apicv_exec_ctrl(). Therefore, remove this warning since it is no longer appropriate. Reviewed-by: Maxim Levitsky <[email protected]> Reviewed-by: Pankaj Gupta <[email protected]> Signed-off-by: Suravee Suthikulpanit <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-24KVM: SVM: Introduce logic to (de)activate x2AVIC modeSuravee Suthikulpanit3-5/+53
Introduce logic to (de)activate AVIC, which also allows switching between AVIC to x2AVIC mode at runtime. When an AVIC-enabled guest switches from APIC to x2APIC mode, the SVM driver needs to perform the following steps: 1. Set the x2APIC mode bit for AVIC in VMCB along with the maximum APIC ID support for each mode accodingly. 2. Disable x2APIC MSRs interception in order to allow the hardware to virtualize x2APIC MSRs accesses. Reported-by: kernel test robot <[email protected]> Reviewed-by: Maxim Levitsky <[email protected]> Signed-off-by: Suravee Suthikulpanit <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-24KVM: x86: nSVM: always intercept x2apic msrsMaxim Levitsky2-0/+14
As a preparation for x2avic, this patch ensures that x2apic msrs are always intercepted for the nested guest. Reviewed-by: Suravee Suthikulpanit <[email protected]> Tested-by: Suravee Suthikulpanit <[email protected]> Signed-off-by: Maxim Levitsky <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-24KVM: SVM: Refresh AVIC configuration when changing APIC modeSuravee Suthikulpanit3-0/+15
AMD AVIC can support xAPIC and x2APIC virtualization, which requires changing x2APIC bit VMCB and MSR intercepton for x2APIC MSRs. Therefore, call avic_refresh_apicv_exec_ctrl() to refresh configuration accordingly. Reviewed-by: Maxim Levitsky <[email protected]> Signed-off-by: Suravee Suthikulpanit <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-24KVM: x86: Deactivate APICv on vCPU with APIC disabledSuravee Suthikulpanit2-2/+6
APICv should be deactivated on vCPU that has APIC disabled. Therefore, call kvm_vcpu_update_apicv() when changing APIC mode, and add additional check for APIC disable mode when determine APICV activation, Suggested-by: Maxim Levitsky <[email protected]> Reviewed-by: Maxim Levitsky <[email protected]> Signed-off-by: Suravee Suthikulpanit <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-24KVM: SVM: Adding support for configuring x2APIC MSRs interceptionSuravee Suthikulpanit2-2/+27
When enabling x2APIC virtualization (x2AVIC), the interception of x2APIC MSRs must be disabled to let the hardware virtualize guest MSR accesses. Current implementation keeps track of list of MSR interception state in the svm_direct_access_msrs array. Therefore, extends the array to include x2APIC MSRs. Reviewed-by: Maxim Levitsky <[email protected]> Signed-off-by: Suravee Suthikulpanit <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-24KVM: SVM: Do not support updating APIC ID when in x2APIC modeSuravee Suthikulpanit1-1/+10
In X2APIC mode, the Logical Destination Register is read-only, which provides a fixed mapping between the logical and physical APIC IDs. Therefore, there is no Logical APIC ID table in X2AVIC and the processor uses the X2APIC ID in the backing page to create a vCPU’s logical ID. In addition, KVM does not support updating APIC ID in x2APIC mode, which means AVIC does not need to handle this case. Therefore, check x2APIC mode when handling physical and logical APIC ID update, and when invalidating logical APIC ID table. Reviewed-by: Maxim Levitsky <[email protected]> Suggested-by: Maxim Levitsky <[email protected]> Signed-off-by: Suravee Suthikulpanit <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-24KVM: SVM: Update avic_kick_target_vcpus to support 32-bit APIC IDSuravee Suthikulpanit1-2/+8
In x2APIC mode, ICRH contains 32-bit destination APIC ID. So, update the avic_kick_target_vcpus() accordingly. Reviewed-by: Maxim Levitsky <[email protected]> Signed-off-by: Suravee Suthikulpanit <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-24KVM: SVM: Update max number of vCPUs supported for x2AVIC modeSuravee Suthikulpanit1-3/+5
xAVIC and x2AVIC modes can support diffferent number of vcpus. Update existing logics to support each mode accordingly. Also, modify the maximum physical APIC ID for AVIC to 255 to reflect the actual value supported by the architecture. Reviewed-by: Maxim Levitsky <[email protected]> Reviewed-by: Pankaj Gupta <[email protected]> Signed-off-by: Suravee Suthikulpanit <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-24KVM: SVM: Detect X2APIC virtualization (x2AVIC) supportSuravee Suthikulpanit3-13/+56
Add CPUID check for the x2APIC virtualization (x2AVIC) feature. If available, the SVM driver can support both AVIC and x2AVIC modes when load the kvm_amd driver with avic=1. The operating mode will be determined at runtime depending on the guest APIC mode. Reviewed-by: Maxim Levitsky <[email protected]> Reviewed-by: Pankaj Gupta <[email protected]> Signed-off-by: Suravee Suthikulpanit <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-24KVM: x86: lapic: Rename [GET/SET]_APIC_DEST_FIELD to [GET/SET]_XAPIC_DEST_FIELDSuravee Suthikulpanit2-3/+3
To signify that the macros only support 8-bit xAPIC destination ID. Suggested-by: Maxim Levitsky <[email protected]> Reviewed-by: Maxim Levitsky <[email protected]> Reviewed-by: Pankaj Gupta <[email protected]> Signed-off-by: Suravee Suthikulpanit <[email protected]> Reviewed-by: Paolo Bonzini <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-24KVM: nVMX: clean up posted interrupt descriptor try_cmpxchgPaolo Bonzini1-7/+8
Rely on try_cmpxchg64 for re-reading the PID on failure, using READ_ONCE only right before the first iteration. Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-24KVM: x86: Enable CMCI capability by default and handle injected UCNA errorsJue Wang2-1/+43
This patch enables MCG_CMCI_P by default in kvm_mce_cap_supported. It reuses ioctl KVM_X86_SET_MCE to implement injection of UnCorrectable No Action required (UCNA) errors, signaled via Corrected Machine Check Interrupt (CMCI). Neither of the CMCI and UCNA emulations depends on hardware. Signed-off-by: Jue Wang <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-24KVM: x86: Add emulation for MSR_IA32_MCx_CTL2 MSRs.Jue Wang1-39/+91
This patch adds the emulation of IA32_MCi_CTL2 registers to KVM. A separate mci_ctl2_banks array is used to keep the existing mce_banks register layout intact. In Machine Check Architecture, in addition to MCG_CMCI_P, bit 30 of the per-bank register IA32_MCi_CTL2 controls whether Corrected Machine Check error reporting is enabled. Signed-off-by: Jue Wang <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-24KVM: x86: Use kcalloc to allocate the mce_banks array.Jue Wang1-1/+1
This patch updates the allocation of mce_banks with the array allocation API (kcalloc) as a precedent for the later mci_ctl2_banks to implement per-bank control of Corrected Machine Check Interrupt (CMCI). Suggested-by: Sean Christopherson <[email protected]> Signed-off-by: Jue Wang <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-24KVM: x86: Add Corrected Machine Check Interrupt (CMCI) emulation to lapic.Jue Wang3-16/+39
This patch calculates the number of lvt entries as part of KVM_X86_MCE_SETUP conditioned on the presence of MCG_CMCI_P bit in MCG_CAP and stores result in kvm_lapic. It translats from APIC_LVTx register to index in lapic_lvt_entry enum. It extends the APIC_LVTx macro as well as other lapic write/reset handling etc to support Corrected Machine Check Interrupt. Signed-off-by: Jue Wang <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-24KVM: x86: Add APIC_LVTx() macro.Jue Wang2-4/+5
An APIC_LVTx macro is introduced to calcualte the APIC_LVTx register offset based on the index in the lapic_lvt_entry enum. Later patches will extend the APIC_LVTx macro to support the APIC_LVTCMCI register in order to implement Corrected Machine Check Interrupt signaling. Suggested-by: Sean Christopherson <[email protected]> Signed-off-by: Jue Wang <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-24KVM: x86/mmu: Avoid unnecessary flush on eager page splitPaolo Bonzini1-12/+20
The TLB flush before installing the newly-populated lower level page table is unnecessary if the lower-level page table maps the huge page identically. KVM knows it is if it did not reuse an existing shadow page table, tell drop_large_spte() to skip the flush in that case. Extracted from a patch by David Matlack. Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-24KVM: x86: Fill apic_lvt_mask with enums / explicit entries.Jue Wang2-10/+21
This patch defines a lapic_lvt_entry enum used as explicit indices to the apic_lvt_mask array. In later patches a LVT_CMCI will be added to implement the Corrected Machine Check Interrupt signaling. Suggested-by: Sean Christopherson <[email protected]> Signed-off-by: Jue Wang <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-24KVM: x86: Make APIC_VERSION capture only the magic 0x14UL.Jue Wang1-2/+2
Refactor APIC_VERSION so that the maximum number of LVT entries is inserted at runtime rather than compile time. This will be used in a subsequent commit to expose the LVT CMCI Register to VMs that support Corrected Machine Check error counting/signaling (IA32_MCG_CAP.MCG_CMCI_P=1). Suggested-by: Sean Christopherson <[email protected]> Signed-off-by: Jue Wang <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-24KVM: x86/mmu: Extend Eager Page Splitting to nested MMUsDavid Matlack1-7/+252
Add support for Eager Page Splitting pages that are mapped by nested MMUs. Walk through the rmap first splitting all 1GiB pages to 2MiB pages, and then splitting all 2MiB pages to 4KiB pages. Note, Eager Page Splitting is limited to nested MMUs as a policy rather than due to any technical reason (the sp->role.guest_mode check could just be deleted and Eager Page Splitting would work correctly for all shadow MMU pages). There is really no reason to support Eager Page Splitting for tdp_mmu=N, since such support will eventually be phased out, and there is no current use case supporting Eager Page Splitting on hosts where TDP is either disabled or unavailable in hardware. Furthermore, future improvements to nested MMU scalability may diverge the code from the legacy shadow paging implementation. These improvements will be simpler to make if Eager Page Splitting does not have to worry about legacy shadow paging. Splitting huge pages mapped by nested MMUs requires dealing with some extra complexity beyond that of the TDP MMU: (1) The shadow MMU has a limit on the number of shadow pages that are allowed to be allocated. So, as a policy, Eager Page Splitting refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer pages available. (2) Splitting a huge page may end up re-using an existing lower level shadow page tables. This is unlike the TDP MMU which always allocates new shadow page tables when splitting. (3) When installing the lower level SPTEs, they must be added to the rmap which may require allocating additional pte_list_desc structs. Case (2) is especially interesting since it may require a TLB flush, unlike the TDP MMU which can fully split huge pages without any TLB flushes. Specifically, an existing lower level page table may point to even lower level page tables that are not fully populated, effectively unmapping a portion of the huge page, which requires a flush. As of this commit, a flush is always done always after dropping the huge page and before installing the lower level page table. This TLB flush could instead be delayed until the MMU lock is about to be dropped, which would batch flushes for multiple splits. However these flushes should be rare in practice (a huge page must be aliased in multiple SPTEs and have been split for NX Huge Pages in only some of them). Flushing immediately is simpler to plumb and also reduces the chances of tripping over a CPU bug (e.g. see iTLB multihit). [ This commit is based off of the original implementation of Eager Page Splitting from Peter in Google's kernel from 2016. ] Suggested-by: Peter Feiner <[email protected]> Signed-off-by: David Matlack <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-24KVM: x86/mmu: pull call to drop_large_spte() into __link_shadow_page()Paolo Bonzini2-39/+35
Before allocating a child shadow page table, all callers check whether the parent already points to a huge page and, if so, they drop that SPTE. This is done by drop_large_spte(). However, dropping the large SPTE is really only necessary before the sp is installed. While the sp is returned by kvm_mmu_get_child_sp(), installing it happens later in __link_shadow_page(). Move the call there instead of having it in each and every caller. To ensure that the shadow page is not linked twice if it was present, do _not_ opportunistically make kvm_mmu_get_child_sp() idempotent: instead, return an error value if the shadow page already existed. This is a bit more verbose, but clearer than NULL. Finally, now that the drop_large_spte() name is not taken anymore, remove the two underscores in front of __drop_large_spte(). Reviewed-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-24KVM: x86/mmu: Zap collapsible SPTEs in shadow MMU at all possible levelsDavid Matlack1-7/+13
Currently KVM only zaps collapsible 4KiB SPTEs in the shadow MMU. This is fine for now since KVM never creates intermediate huge pages during dirty logging. In other words, KVM always replaces 1GiB pages directly with 4KiB pages, so there is no reason to look for collapsible 2MiB pages. However, this will stop being true once the shadow MMU participates in eager page splitting. During eager page splitting, each 1GiB is first split into 2MiB pages and then those are split into 4KiB pages. The intermediate 2MiB pages may be left behind if an error condition causes eager page splitting to bail early. No functional change intended. Reviewed-by: Peter Xu <[email protected]> Signed-off-by: David Matlack <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-24KVM: x86/mmu: Extend make_huge_page_split_spte() for the shadow MMUDavid Matlack3-11/+10
Currently make_huge_page_split_spte() assumes execute permissions can be granted to any 4K SPTE when splitting huge pages. This is true for the TDP MMU but is not necessarily true for the shadow MMU, since KVM may be shadowing a non-executable huge page. To fix this, pass in the role of the child shadow page where the huge page will be split and derive the execution permission from that. This is correct because huge pages are always split with direct shadow page and thus the shadow page role contains the correct access permissions. No functional change intended. Signed-off-by: David Matlack <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-24KVM: x86/mmu: Cache the access bits of shadowed translationsDavid Matlack3-28/+83
Splitting huge pages requires allocating/finding shadow pages to replace the huge page. Shadow pages are keyed, in part, off the guest access permissions they are shadowing. For fully direct MMUs, there is no shadowing so the access bits in the shadow page role are always ACC_ALL. But during shadow paging, the guest can enforce whatever access permissions it wants. In particular, eager page splitting needs to know the permissions to use for the subpages, but KVM cannot retrieve them from the guest page tables because eager page splitting does not have a vCPU. Fortunately, the guest access permissions are easy to cache whenever page faults or FNAME(sync_page) update the shadow page tables; this is an extension of the existing cache of the shadowed GFNs in the gfns array of the shadow page. The access bits only take up 3 bits, which leaves 61 bits left over for gfns, which is more than enough. Now that the gfns array caches more information than just GFNs, rename it to shadowed_translation. While here, preemptively fix up the WARN_ON() that detects gfn mismatches in direct SPs. The WARN_ON() was paired with a pr_err_ratelimited(), which means that users could sometimes see the WARN without the accompanying error message. Fix this by outputting the error message as part of the WARN splat, and opportunistically make them WARN_ONCE() because if these ever fire, they are all but guaranteed to fire a lot and will bring down the kernel. Signed-off-by: David Matlack <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-24KVM: x86/mmu: Update page stats in __rmap_add()David Matlack1-1/+2
Update the page stats in __rmap_add() rather than at the call site. This will avoid having to manually update page stats when splitting huge pages in a subsequent commit. No functional change intended. Reviewed-by: Ben Gardon <[email protected]> Reviewed-by: Peter Xu <[email protected]> Signed-off-by: David Matlack <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-24KVM: x86/mmu: Decouple rmap_add() and link_shadow_page() from kvm_vcpuDavid Matlack1-18/+29
Allow adding new entries to the rmap and linking shadow pages without a struct kvm_vcpu pointer by moving the implementation of rmap_add() and link_shadow_page() into inner helper functions. No functional change intended. Reviewed-by: Ben Gardon <[email protected]> Reviewed-by: Peter Xu <[email protected]> Signed-off-by: David Matlack <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-24KVM: x86/mmu: Pass const memslot to rmap_add()David Matlack1-1/+1
Constify rmap_add()'s @slot parameter; it is simply passed on to gfn_to_rmap(), which takes a const memslot. No functional change intended. Reviewed-by: Ben Gardon <[email protected]> Signed-off-by: David Matlack <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>