aboutsummaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)AuthorFilesLines
2024-06-20KVM: x86/tdp_mmu: Rename REMOVED_SPTE to FROZEN_SPTERick Edgecombe4-28/+28
Rename REMOVED_SPTE to FROZEN_SPTE so that it can be used for other multi-part operations. REMOVED_SPTE is used as a non-present intermediate value for multi-part operations that can happen when a thread doesn't have an MMU write lock. Today these operations are when removing PTEs. However, future changes will want to use the same concept for setting a PTE. In that case the REMOVED_SPTE name does not quite fit. So rename it to FROZEN_SPTE so it can be used for both types of operations. Also rename the relevant helpers and comments that refer to "removed" within the context of the SPTE value. Take care to not update naming referring the "remove" operations, which are still distinct. Suggested-by: Paolo Bonzini <[email protected]> Signed-off-by: Rick Edgecombe <[email protected]> Message-ID: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-06-20Merge branch 'kvm-6.10-fixes' into HEADPaolo Bonzini2-8/+4
2024-06-20KVM: x86/tdp_mmu: Sprinkle __must_checkIsaku Yamahata1-6/+7
The TDP MMU function __tdp_mmu_set_spte_atomic uses a cmpxchg64 to replace the SPTE value and returns -EBUSY on failure. The caller must check the return value and retry. Add __must_check to it, as well as to two more functions that forward the return value of __tdp_mmu_set_spte_atomic to their caller. Signed-off-by: Isaku Yamahata <[email protected]> Reviewed-by: Binbin Wu <[email protected]> Message-Id: <8f7d5a1b241bf5351eaab828d1a1efe5c17699ca.1705965635.git.isaku.yamahata@intel.com> Acked-by: Kai Huang <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-06-20x86/cpufeatures: Flip the /proc/cpuinfo appearance logicBorislav Petkov (AMD)3-457/+456
I'm getting tired of telling people to put a magic "" in the #define X86_FEATURE /* "" ... */ comment to hide the new feature flag from the user-visible /proc/cpuinfo. Flip the logic to make it explicit: an explicit "<name>" in the comment adds the flag to /proc/cpuinfo and otherwise not, by default. Add the "<name>" of all the existing flags to keep backwards compatibility with userspace. There should be no functional changes resulting from this. Acked-by: Dave Hansen <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-06-20randomize_kstack: Remove non-functional per-arch entropy filteringKees Cook1-9/+6
An unintended consequence of commit 9c573cd31343 ("randomize_kstack: Improve entropy diffusion") was that the per-architecture entropy size filtering reduced how many bits were being added to the mix, rather than how many bits were being used during the offsetting. All architectures fell back to the existing default of 0x3FF (10 bits), which will consume at most 1KiB of stack space. It seems that this is working just fine, so let's avoid the confusion and update everything to use the default. The prior intent of the per-architecture limits were: arm64: capped at 0x1FF (9 bits), 5 bits effective powerpc: uncapped (10 bits), 6 or 7 bits effective riscv: uncapped (10 bits), 6 bits effective x86: capped at 0xFF (8 bits), 5 (x86_64) or 6 (ia32) bits effective s390: capped at 0xFF (8 bits), undocumented effective entropy Current discussion has led to just dropping the original per-architecture filters. The additional entropy appears to be safe for arm64, x86, and s390. Quoting Arnd, "There is no point pretending that 15.75KB is somehow safe to use while 15.00KB is not." Co-developed-by: Yuntao Liu <[email protected]> Signed-off-by: Yuntao Liu <[email protected]> Fixes: 9c573cd31343 ("randomize_kstack: Improve entropy diffusion") Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Arnd Bergmann <[email protected]> Acked-by: Mark Rutland <[email protected]> Acked-by: Heiko Carstens <[email protected]> # s390 Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Kees Cook <[email protected]>
2024-06-20KVM: x86: Always sync PIR to IRR prior to scanning I/O APIC routesSean Christopherson1-5/+4
Sync pending posted interrupts to the IRR prior to re-scanning I/O APIC routes, irrespective of whether the I/O APIC is emulated by userspace or by KVM. If a level-triggered interrupt routed through the I/O APIC is pending or in-service for a vCPU, KVM needs to intercept EOIs on said vCPU even if the vCPU isn't the destination for the new routing, e.g. if servicing an interrupt using the old routing races with I/O APIC reconfiguration. Commit fceb3a36c29a ("KVM: x86: ioapic: Fix level-triggered EOI and userspace I/OAPIC reconfigure race") fixed the common cases, but kvm_apic_pending_eoi() only checks if an interrupt is in the local APIC's IRR or ISR, i.e. misses the uncommon case where an interrupt is pending in the PIR. Failure to intercept EOI can manifest as guest hangs with Windows 11 if the guest uses the RTC as its timekeeping source, e.g. if the VMM doesn't expose a more modern form of time to the guest. Cc: [email protected] Cc: Adamos Ttofari <[email protected]> Cc: Raghavendra Rao Ananta <[email protected]> Reviewed-by: Jim Mattson <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Message-ID: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-06-20x86/kconfig: Add as-instr64 macro to properly evaluate AS_WRUSSMasahiro Yamada1-1/+1
Some instructions are only available on the 64-bit architecture. Bi-arch compilers that default to -m32 need the explicit -m64 option to evaluate them properly. Fixes: 18e66b695e78 ("x86/shstk: Add Kconfig option for shadow stack") Closes: https://lore.kernel.org/all/[email protected]/ Reported-by: Dmitry Safonov <[email protected]> Signed-off-by: Masahiro Yamada <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Tested-by: Dmitry Safonov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-06-19x86/alternatives: Make FineIBT mode Kconfig selectableKees Cook3-5/+14
Since FineIBT performs checking at the destination, it is weaker against attacks that can construct arbitrary executable memory contents. As such, some system builders want to run with FineIBT disabled by default. Allow the "cfi=kcfi" boot param mode to be selectable through Kconfig via the newly introduced CONFIG_CFI_AUTO_DEFAULT. Reviewed-by: Sami Tolvanen <[email protected]> Reviewed-by: Nathan Chancellor <[email protected]> Tested-by: Nathan Chancellor <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Kees Cook <[email protected]>
2024-06-19x86-64: word-at-a-time: improve byte count calculationsLinus Torvalds1-34/+23
This switches x86-64 over to using 'tzcount' instead of the integer multiply trick to turn the bytemask information into actual byte counts. We even had a comment saying that a fast bit count instruction is better than a multiply, but x86 bit counting has traditionally been "questionably fast", and so avoiding it was the right thing back in the days. Now, on any half-way modern core, using bit counting is cheaper and smaller than the large constant multiply, so let's just switch over. Note that as part of switching over to counting bits, we also do it at a different point. We used to create the byte count from the final byte mask, but once you use the 'tzcount' instruction (aka 'bsf' on older CPU's), you can actually count the leading zeroes using a value we have available earlier. In fact, we can just use the very first mask of bits that tells us whether we have any zero bytes at all. The zero bytes in the word will have the high bit set, so just doing 'tzcount' on that value and dividing by 8 will give the number of bytes that precede the first NUL character, which is exactly what we want. Note also that the input value to the tzcount is by definition not zero, since that is the condition that we already used to check the whole "do we have any zero bytes at all". So we don't need to worry about the legacy instruction behavior of pre-lzcount days when 'bsf' didn't have a result for zero input. The 32-bit code continues to use the bimple bit op trick that is faster even on newer cores, but particularly on the older 32-bit-only ones. Signed-off-by: Linus Torvalds <[email protected]>
2024-06-19runtime constants: add x86 architecture supportLinus Torvalds2-0/+64
This implements the runtime constant infrastructure for x86, allowing the dcache d_hash() function to be generated using as a constant for hash table address followed by shift by a constant of the hash index. Signed-off-by: Linus Torvalds <[email protected]>
2024-06-19x86/uaccess: Improve the 8-byte getuser() caseLinus Torvalds1-49/+20
Streamline the 8-byte case and drop the special handling. Use a macro which hides the exception handling. No functional changes. Signed-off-by: Linus Torvalds <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Link: https://lore.kernel.org/r/CAHk-=whYb2L_atsRk9pBiFiVLGe5wNZLHhRinA69yu6FiKvDsw@mail.gmail.com
2024-06-19x86/resctrl: Don't try to free nonexistent RMIDsDave Martin1-1/+2
Commit 6791e0ea3071 ("x86/resctrl: Access per-rmid structures by index") adds logic to map individual monitoring groups into a global index space used for tracking allocated RMIDs. Attempts to free the default RMID are ignored in free_rmid(), and this works fine on x86. With arm64 MPAM, there is a latent bug here however: on platforms with no monitors exposed through resctrl, each control group still gets a different monitoring group ID as seen by the hardware, since the CLOSID always forms part of the monitoring group ID. This means that when removing a control group, the code may try to free this group's default monitoring group RMID for real. If there are no monitors however, the RMID tracking table rmid_ptrs[] would be a waste of memory and is never allocated, leading to a splat when free_rmid() tries to dereference the table. One option would be to treat RMID 0 as special for every CLOSID, but this would be ugly since bookkeeping still needs to be done for these monitoring group IDs when there are monitors present in the hardware. Instead, add a gating check of resctrl_arch_mon_capable() in free_rmid(), and just do nothing if the hardware doesn't have monitors. This fix mirrors the gating checks already present in mkdir_rdt_prepare_rmid_alloc() and elsewhere. No functional change on x86. [ bp: Massage commit message. ] Fixes: 6791e0ea3071 ("x86/resctrl: Access per-rmid structures by index") Signed-off-by: Dave Martin <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Tested-by: Reinette Chatre <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-06-19Merge drm/drm-next into drm-intel-nextJani Nikula260-3879/+5956
Sync to v6.10-rc3. Signed-off-by: Jani Nikula <[email protected]>
2024-06-18KVM: Introduce vcpu->wants_to_runDavid Matlack1-2/+2
Introduce vcpu->wants_to_run to indicate when a vCPU is in its core run loop, i.e. when the vCPU is running the KVM_RUN ioctl and immediate_exit was not set. Replace all references to vcpu->run->immediate_exit with !vcpu->wants_to_run to avoid TOCTOU races with userspace. For example, a malicious userspace could invoked KVM_RUN with immediate_exit=true and then after KVM reads it to set wants_to_run=false, flip it to false. This would result in the vCPU running in KVM_RUN with wants_to_run=false. This wouldn't cause any real bugs today but is a dangerous landmine. Signed-off-by: David Matlack <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-06-18KVM: x86: Prevent excluding the BSP on setting max_vcpu_idsSean Christopherson1-1/+3
If the BSP vCPU ID was already set, ensure it doesn't get excluded when limiting vCPU IDs via KVM_CAP_MAX_VCPU_ID. [mks: provide commit message, code by Sean] Signed-off-by: Mathias Krause <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-06-18KVM: x86: Limit check IDs for KVM_SET_BOOT_CPU_IDMathias Krause1-0/+3
Do not accept IDs which are definitely invalid by limit checking the passed value against KVM_MAX_VCPU_IDS and 'max_vcpu_ids' if it was already set. This ensures invalid values, especially on 64-bit systems, don't go unnoticed and lead to a valid id by chance when truncated by the final assignment. Fixes: 73880c80aa9c ("KVM: Break dependency between vcpu index in vcpus array and vcpu_id.") Signed-off-by: Mathias Krause <[email protected]> Link: https://lore.kernel.org/r/[email protected] Co-developed-by: Sean Christopherson <[email protected]> Signed-off-by: Sean Christopherson <[email protected]>
2024-06-18Merge tag 'efi-fixes-for-v6.10-3' of ↵Linus Torvalds2-2/+11
git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi Pull EFI fixes from Ard Biesheuvel: "Another small set of EFI fixes. Only the x86 one is likely to affect any actual users (and has a cc:stable), but the issue it fixes was only observed in an unusual context (kexec in a confidential VM). - Ensure that EFI runtime services are not unmapped by PAN on ARM - Avoid freeing the memory holding the EFI memory map inadvertently on x86 - Avoid a false positive kmemleak warning on arm64" * tag 'efi-fixes-for-v6.10-3' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi: efi/arm64: Fix kmemleak false positive in arm64_efi_rt_init() efi/x86: Free EFI memory map only when installing a new one. efi/arm: Disable LPAE PAN when calling EFI runtime services
2024-06-17x86/sev: Allow non-VMPL0 execution when an SVSM is presentTom Lendacky3-11/+27
To allow execution at a level other than VMPL0, an SVSM must be present. Allow the SEV-SNP guest to continue booting if an SVSM is detected and the hypervisor supports the SVSM feature as indicated in the GHCB hypervisor features bitmap. [ bp: Massage a bit. ] Signed-off-by: Tom Lendacky <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Link: https://lore.kernel.org/r/2ce7cf281cce1d0cba88f3f576687ef75dc3c953.1717600736.git.thomas.lendacky@amd.com
2024-06-17x86/sev: Extend the config-fs attestation support for an SVSMTom Lendacky2-1/+80
When an SVSM is present, the guest can also request attestation reports from it. These SVSM attestation reports can be used to attest the SVSM and any services running within the SVSM. Extend the config-fs attestation support to provide such. This involves creating four new config-fs attributes: - 'service-provider' (input) This attribute is used to determine whether the attestation request should be sent to the specified service provider or to the SEV firmware. The SVSM service provider is represented by the value 'svsm'. - 'service_guid' (input) Used for requesting the attestation of a single service within the service provider. A null GUID implies that the SVSM_ATTEST_SERVICES call should be used to request the attestation report. A non-null GUID implies that the SVSM_ATTEST_SINGLE_SERVICE call should be used. - 'service_manifest_version' (input) Used with the SVSM_ATTEST_SINGLE_SERVICE call, the service version represents a specific service manifest version be used for the attestation report. - 'manifestblob' (output) Used to return the service manifest associated with the attestation report. Only display these new attributes when running under an SVSM. [ bp: Massage. - s/svsm_attestation_call/svsm_attest_call/g ] Signed-off-by: Tom Lendacky <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Link: https://lore.kernel.org/r/965015dce3c76bb8724839d50c5dea4e4b5d598f.1717600736.git.thomas.lendacky@amd.com
2024-06-17virt: sev-guest: Choose the VMPCK key based on executing VMPLTom Lendacky2-3/+11
Currently, the sev-guest driver uses the vmpck-0 key by default. When an SVSM is present, the kernel is running at a VMPL other than 0 and the vmpck-0 key is no longer available. If a specific vmpck key has not be requested by the user via the vmpck_id module parameter, choose the vmpck key based on the active VMPL level. Signed-off-by: Tom Lendacky <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Link: https://lore.kernel.org/r/b88081c5d88263176849df8ea93e90a404619cab.1717600736.git.thomas.lendacky@amd.com
2024-06-17x86/sev: Provide guest VMPL level to userspaceTom Lendacky1-0/+46
Requesting an attestation report from userspace involves providing the VMPL level for the report. Currently any value from 0-3 is valid because Linux enforces running at VMPL0. When an SVSM is present, though, Linux will not be running at VMPL0 and only VMPL values starting at the VMPL level Linux is running at to 3 are valid. In order to allow userspace to determine the minimum VMPL value that can be supplied to an attestation report, create a sysfs entry that can be used to retrieve the current VMPL level of the kernel. [ bp: Add CONFIG_SYSFS ifdeffery. ] Signed-off-by: Tom Lendacky <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Link: https://lore.kernel.org/r/fff846da0d8d561f9fdaf297dcf8cd907545a25b.1717600736.git.thomas.lendacky@amd.com
2024-06-17x86/sev: Provide SVSM discovery supportTom Lendacky4-0/+25
The SVSM specification documents an alternative method of discovery for the SVSM using a reserved CPUID bit and a reserved MSR. This is intended for guest components that do not have access to the secrets page in order to be able to call the SVSM (e.g. UEFI runtime services). For the MSR support, a new reserved MSR 0xc001f000 has been defined. A #VC should be generated when accessing this MSR. The #VC handler is expected to ignore writes to this MSR and return the physical calling area address (CAA) on reads of this MSR. While the CPUID leaf is updated, allowing the creation of a CPU feature, the code will continue to use the VMPL level as an indication of the presence of an SVSM. This is because the SVSM can be called well before the CPU feature is in place and a non-zero VMPL requires that an SVSM be present. Signed-off-by: Tom Lendacky <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Link: https://lore.kernel.org/r/4f93f10a2ff3e9f368fd64a5920d51bf38d0c19e.1717600736.git.thomas.lendacky@amd.com
2024-06-17x86/sev: Use the SVSM to create a vCPU when not in VMPL0Tom Lendacky2-20/+56
Using the RMPADJUST instruction, the VMSA attribute can only be changed at VMPL0. An SVSM will be present when running at VMPL1 or a lower privilege level. In that case, use the SVSM_CORE_CREATE_VCPU call or the SVSM_CORE_DESTROY_VCPU call to perform VMSA attribute changes. Use the VMPL level supplied by the SVSM for the VMSA when starting the AP. [ bp: Fix typo + touchups. ] Signed-off-by: Tom Lendacky <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Link: https://lore.kernel.org/r/bcdd95ecabe9723673b9693c7f1533a2b8f17781.1717600736.git.thomas.lendacky@amd.com
2024-06-17x86/sev: Perform PVALIDATE using the SVSM when not at VMPL0Tom Lendacky4-24/+328
The PVALIDATE instruction can only be performed at VMPL0. If an SVSM is present, it will be running at VMPL0 while the guest itself is then running at VMPL1 or a lower privilege level. In that case, use the SVSM_CORE_PVALIDATE call to perform memory validation instead of issuing the PVALIDATE instruction directly. The validation of a single 4K page is now explicitly identified as such in the function name, pvalidate_4k_page(). The pvalidate_pages() function is used for validating 1 or more pages at either 4K or 2M in size. Each function, however, determines whether it can issue the PVALIDATE directly or whether the SVSM needs to be invoked. [ bp: Touchups. ] [ Tom: fold in a fix for Coconut SVSM: https://lore.kernel.org/r/[email protected] ] Signed-off-by: Tom Lendacky <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Link: https://lore.kernel.org/r/4c4017d8b94512d565de9ccb555b1a9f8983c69c.1717600736.git.thomas.lendacky@amd.com
2024-06-17perf/x86/intel/uncore: Support HBM and CXL PMON countersKan Liang1-2/+53
Unknown uncore PMON types can be found in both SPR and EMR with HBM or CXL. $ls /sys/devices/ | grep type uncore_type_12_16 uncore_type_12_18 uncore_type_12_2 uncore_type_12_4 uncore_type_12_6 uncore_type_12_8 uncore_type_13_17 uncore_type_13_19 uncore_type_13_3 uncore_type_13_5 uncore_type_13_7 uncore_type_13_9 The unknown PMON types are HBM and CXL PMON. Except for the name, the other information regarding the HBM and CXL PMON counters can be retrieved via the discovery table. Add them into the uncores tables for SPR and EMR. The event config registers for all CXL related units are 8-byte apart. Add SPR_UNCORE_MMIO_OFFS8_COMMON_FORMAT to specially handle it. Signed-off-by: Kan Liang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Yunying Sun <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-06-17perf/x86/uncore: Cleanup unused unit structureKan Liang4-112/+12
The unit control and ID information are retrieved from the unit control RB tree. No one uses the old structure anymore. Remove them. Signed-off-by: Kan Liang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Yunying Sun <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-06-17perf/x86/uncore: Apply the unit control RB tree to PCI uncore unitsKan Liang5-48/+94
The unit control RB tree has the unit control and unit ID information for all the PCI units. Use them to replace the box_ctls/pci_offsets to get an accurate unit control address for PCI uncore units. The UPI/M3UPI units in the discovery table are ignored. Please see the commit 65248a9a9ee1 ("perf/x86/uncore: Add a quirk for UPI on SPR"). Manually allocate a unit control RB tree for UPI/M3UPI. Add cleanup_extra_boxes to release such manual allocation. Signed-off-by: Kan Liang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Yunying Sun <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-06-17perf/x86/uncore: Apply the unit control RB tree to MSR uncore unitsKan Liang4-11/+59
The unit control RB tree has the unit control and unit ID information for all the MSR units. Use them to replace the box_ctl and uncore_msr_box_ctl() to get an accurate unit control address for MSR uncore units. Add intel_generic_uncore_assign_hw_event(), which utilizes the accurate unit control address from the unit control RB tree to calculate the config_base and event_base. The unit id related information should be retrieved from the unit control RB tree as well. Signed-off-by: Kan Liang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Yunying Sun <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-06-17perf/x86/uncore: Apply the unit control RB tree to MMIO uncore unitsKan Liang1-16/+14
The unit control RB tree has the unit control and unit ID information for all the units. Use it to replace the box_ctls/mmio_offsets to get an accurate unit control address for MMIO uncore units. Signed-off-by: Kan Liang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Yunying Sun <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-06-17perf/x86/uncore: Retrieve the unit ID from the unit control RB treeKan Liang1-0/+3
The box_ids only save the unit ID for the first die. If a unit, e.g., a CXL unit, doesn't exist in the first die. The unit ID cannot be retrieved. The unit control RB tree also stores the unit ID information. Retrieve the unit ID from the unit control RB tree Signed-off-by: Kan Liang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Yunying Sun <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-06-17perf/x86/uncore: Support per PMU cpumaskKan Liang4-5/+89
The cpumask of some uncore units, e.g., CXL uncore units, may be wrong under some configurations. Perf may access an uncore counter of a non-existent uncore unit. The uncore driver assumes that all uncore units are symmetric among dies. A global cpumask is shared among all uncore PMUs. However, some CXL uncore units may only be available on some dies. A per PMU cpumask is introduced to track the CPU mask of this PMU. The driver searches the unit control RB tree to check whether the PMU is available on a given die, and updates the per PMU cpumask accordingly. Signed-off-by: Kan Liang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Yunying Sun <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-06-17perf/x86/uncore: Save the unit control address of all unitsKan Liang2-2/+87
The unit control address of some CXL units may be wrongly calculated under some configuration on a EMR machine. The current implementation only saves the unit control address of the units from the first die, and the first unit of the rest of dies. Perf assumed that the units from the other dies have the same offset as the first die. So the unit control address of the rest of the units can be calculated. However, the assumption is wrong, especially for the CXL units. Introduce an RB tree for each uncore type to save the unit control address and three kinds of ID information (unit ID, PMU ID, and die ID) for all units. The unit ID is a physical ID of a unit. The PMU ID is a logical ID assigned to a unit. The logical IDs start from 0 and must be contiguous. The physical ID and the logical ID are 1:1 mapping. The units with the same physical ID in different dies share the same PMU. The die ID indicates which die a unit belongs to. The RB tree can be searched by two different keys (unit ID or PMU ID + die ID). During the RB tree setup, the unit ID is used as a key to look up the RB tree. The perf can create/assign a proper PMU ID to the unit. Later, after the RB tree is setup, PMU ID + die ID is used as a key to look up the RB tree to fill the cpumask of a PMU. It's used more frequently, so PMU ID + die ID is compared in the unit_less(). The uncore_find_unit() has to be O(N). But the RB tree setup only occurs once during the driver load time. It should be acceptable. Compared with the current implementation, more space is required to save the information of all units. The extra size should be acceptable. For example, on EMR, there are 221 units at most. For a 2-socket machine, the extra space is ~6KB at most. Signed-off-by: Kan Liang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-06-17x86/acpi: Add support for CPU offlining for ACPI MADT wakeup methodKirill A. Shutemov4-3/+213
MADT Multiprocessor Wakeup structure version 1 brings support for CPU offlining: BIOS provides a reset vector where the CPU has to jump to for offlining itself. The new TEST mailbox command can be used to test whether the CPU offlined itself which means the BIOS has control over the CPU and can online it again via the ACPI MADT wakeup method. Add CPU offlining support for the ACPI MADT wakeup method by implementing custom cpu_die(), play_dead() and stop_this_cpu() SMP operations. CPU offlining makes it possible to hand over secondary CPUs over kexec, not limiting the second kernel to a single CPU. The change conforms to the approved ACPI spec change proposal. See the Link. Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Kuppuswamy Sathyanarayanan <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Acked-by: Kai Huang <[email protected]> Acked-by: Rafael J. Wysocki <[email protected]> Tested-by: Tao Liu <[email protected]> Link: https://lore.kernel.org/all/13356251.uLZWGnKmhe@kreacher Link: https://lore.kernel.org/r/[email protected]
2024-06-17x86/mm: Introduce kernel_ident_mapping_free()Kirill A. Shutemov2-0/+76
The helper complements kernel_ident_mapping_init(): it frees the identity mapping that was previously allocated. It will be used in the error path to free a partially allocated mapping or if the mapping is no longer needed. The caller provides a struct x86_mapping_info with the free_pgd_page() callback hooked up and the pgd_t to free. Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Acked-by: Kai Huang <[email protected]> Tested-by: Tao Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-06-17x86/smp: Add smp_ops.stop_this_cpu() callbackKirill A. Shutemov3-0/+14
If the helper is defined, it is called instead of halt() to stop the CPU at the end of stop_this_cpu() and on crash CPU shutdown. ACPI MADT will use it to hand over the CPU to BIOS in order to be able to wake it up again after kexec. Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Acked-by: Kai Huang <[email protected]> Tested-by: Tao Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-06-17x86/acpi: Do not attempt to bring up secondary CPUs in the kexec caseKirill A. Shutemov1-1/+28
ACPI MADT doesn't allow to offline a CPU after it was onlined. This limits kexec: the second kernel won't be able to use more than one CPU. To prevent a kexec kernel from onlining secondary CPUs, invalidate the mailbox address in the ACPI MADT wakeup structure which prevents a kexec kernel to use it. This is safe as the booting kernel has the mailbox address cached already and acpi_wakeup_cpu() uses the cached value to bring up the secondary CPUs. Note: This is a Linux specific convention and not covered by the ACPI specification. Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Kai Huang <[email protected]> Reviewed-by: Kuppuswamy Sathyanarayanan <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Acked-by: Rafael J. Wysocki <[email protected]> Tested-by: Tao Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-06-17x86/acpi: Rename fields in the acpi_madt_multiproc_wakeup structureKirill A. Shutemov1-1/+1
In order to support MADT wakeup structure version 1, provide more appropriate names for the fields in the structure. Rename 'mailbox_version' to 'version'. This field signifies the version of the structure and the related protocols, rather than the version of the mailbox. This field has not been utilized in the code thus far. Rename 'base_address' to 'mailbox_address' to clarify the kind of address it represents. In version 1, the structure includes the reset vector address. Clear and distinct naming helps to prevent any confusion. Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Kai Huang <[email protected]> Reviewed-by: Kuppuswamy Sathyanarayanan <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Acked-by: Rafael J. Wysocki <[email protected]> Tested-by: Tao Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-06-17x86/mm: Do not zap page table entries mapping unaccepted memory table during ↵Ashish Kalra1-4/+12
kdump During crashkernel boot only pre-allocated crash memory is presented as E820_TYPE_RAM. This can cause page table entries mapping unaccepted memory table to be zapped during phys_pte_init(), phys_pmd_init(), phys_pud_init() and phys_p4d_init() as SNP/TDX guest use E820_TYPE_ACPI to store the unaccepted memory table and pass it between the kernels on kexec/kdump. E820_TYPE_ACPI covers not only ACPI data, but also EFI tables and might be required by kernel to function properly. The problem was discovered during debugging kdump for SNP guest. The unaccepted memory table stored with E820_TYPE_ACPI and passed between the kernels on kdump was getting zapped as the PMD entry mapping this is above the E820_TYPE_RAM range for the reserved crashkernel memory. Signed-off-by: Ashish Kalra <[email protected]> Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-06-17x86/mm: Make e820__end_ram_pfn() cover E820_TYPE_ACPI rangesKirill A. Shutemov1-4/+5
e820__end_of_ram_pfn() is used to calculate max_pfn which, among other things, guides where direct mapping ends. Any memory above max_pfn is not going to be present in the direct mapping. e820__end_of_ram_pfn() finds the end of the RAM based on the highest E820_TYPE_RAM range. But it doesn't includes E820_TYPE_ACPI ranges into calculation. Despite the name, E820_TYPE_ACPI covers not only ACPI data, but also EFI tables and might be required by kernel to function properly. Usually the problem is hidden because there is some E820_TYPE_RAM memory above E820_TYPE_ACPI. But crashkernel only presents pre-allocated crash memory as E820_TYPE_RAM on boot. If the pre-allocated range is small, it can fit under the last E820_TYPE_ACPI range. Modify e820__end_of_ram_pfn() and e820__end_of_low_ram_pfn() to cover E820_TYPE_ACPI memory. The problem was discovered during debugging kexec for TDX guest. TDX guest uses E820_TYPE_ACPI to store the unaccepted memory bitmap and pass it between the kernels on kexec. Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Dave Hansen <[email protected]> Tested-by: Tao Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-06-17x86/tdx: Convert shared memory back to private on kexecKirill A. Shutemov4-3/+141
TDX guests allocate shared buffers to perform I/O. It is done by allocating pages normally from the buddy allocator and converting them to shared with set_memory_decrypted(). The second, kexec-ed kernel has no idea what memory is converted this way. It only sees E820_TYPE_RAM. Accessing shared memory via private mapping is fatal. It leads to unrecoverable TD exit. On kexec, walk direct mapping and convert all shared memory back to private. It makes all RAM private again and second kernel may use it normally. The conversion occurs in two steps: stopping new conversions and unsharing all memory. In the case of normal kexec, the stopping of conversions takes place while scheduling is still functioning. This allows for waiting until any ongoing conversions are finished. The second step is carried out when all CPUs except one are inactive and interrupts are disabled. This prevents any conflicts with code that may access shared memory. Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Rick Edgecombe <[email protected]> Reviewed-by: Kai Huang <[email protected]> Tested-by: Tao Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-06-17x86/mm: Add callbacks to prepare encrypted memory for kexecKirill A. Shutemov4-0/+38
AMD SEV and Intel TDX guests allocate shared buffers for performing I/O. This is done by allocating pages normally from the buddy allocator and then converting them to shared using set_memory_decrypted(). On kexec, the second kernel is unaware of which memory has been converted in this manner. It only sees E820_TYPE_RAM. Accessing shared memory as private is fatal. Therefore, the memory state must be reset to its original state before starting the new kernel with kexec. The process of converting shared memory back to private occurs in two steps: - enc_kexec_begin() stops new conversions. - enc_kexec_finish() unshares all existing shared memory, reverting it back to private. Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Nikolay Borisov <[email protected]> Reviewed-by: Kai Huang <[email protected]> Tested-by: Tao Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-06-17x86/tdx: Account shared memoryKirill A. Shutemov1-0/+7
The kernel will convert all shared memory back to private during kexec. The direct mapping page tables will provide information on which memory is shared. It is extremely important to convert all shared memory. If a page is missed, it will cause the second kernel to crash when it accesses it. Keep track of the number of shared pages. This will allow for cross-checking against the shared information in the direct mapping and reporting if the shared bit is lost. Memory conversion is slow and does not happen often. Global atomic is not going to be a bottleneck. Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Kai Huang <[email protected]> Tested-by: Tao Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-06-17x86/mm: Return correct level from lookup_address() if pte is noneKirill A. Shutemov2-11/+11
Currently, lookup_address() returns two things: 1. A "pte_t" (which might be a p[g4um]d_t) 2. The 'level' of the page tables where the "pte_t" was found (returned via a pointer) If no pte_t is found, 'level' is essentially garbage. Always fill out the level. For NULL "pte_t"s, fill in the level where the p*d_none() entry was found mirroring the "found" behavior. Always filling out the level allows using lookup_address() to precisely skip over holes when walking kernel page tables. Add one more entry into enum pg_level to indicate the size of the VA covered by one PGD entry in 5-level paging mode. Update comments for lookup_address() and lookup_address_in_pgd() to reflect changes in the interface. Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Rick Edgecombe <[email protected]> Reviewed-by: Baoquan He <[email protected]> Reviewed-by: Dave Hansen <[email protected]> Tested-by: Tao Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-06-17x86/mm: Make x86_platform.guest.enc_status_change_*() return an errorKirill A. Shutemov6-34/+36
TDX is going to have more than one reason to fail enc_status_change_prepare(). Change the callback to return errno instead of assuming -EIO. Change enc_status_change_finish() too to keep the interface symmetric. Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Dave Hansen <[email protected]> Reviewed-by: Kai Huang <[email protected]> Reviewed-by: Michael Kelley <[email protected]> Tested-by: Tao Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-06-17x86/kexec: Keep CR4.MCE set during kexec for TDX guestKirill A. Shutemov1-7/+10
TDX guests run with MCA enabled (CR4.MCE=1b) from the very start. If that bit is cleared during CR4 register reprogramming during boot or kexec flows, a #VE exception will be raised which the guest kernel cannot handle. Therefore, make sure the CR4.MCE setting is preserved over kexec too and avoid raising any #VEs. Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-06-17x86/relocate_kernel: Use named labels for less confusionBorislav Petkov1-6/+7
That identity_mapped() function was loving that "1" label to the point of completely confusing its readers. Use named labels in each place for clarity. No functional changes. Signed-off-by: Borislav Petkov (AMD) <[email protected]> Signed-off-by: Kirill A. Shutemov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-06-17cpu/hotplug, x86/acpi: Disable CPU offlining for ACPI MADT wakeupKirill A. Shutemov2-1/+3
ACPI MADT doesn't allow to offline a CPU after it has been woken up. Currently, CPU hotplug is prevented based on the confidential computing attribute which is set for Intel TDX. But TDX is not the only possible user of the wake up method. Any platform that uses ACPI MADT wakeup method cannot offline CPU. Disable CPU offlining on ACPI MADT wakeup enumeration. This has no visible effects for users: currently, TDX guest is the only platform that uses the ACPI MADT wakeup method. Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Acked-by: Rafael J. Wysocki <[email protected]> Tested-by: Tao Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-06-17x86/apic: Mark acpi_mp_wake_* variables as __ro_after_initKirill A. Shutemov1-2/+2
acpi_mp_wake_mailbox_paddr and acpi_mp_wake_mailbox are initialized once during ACPI MADT init and never changed. Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Baoquan He <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Acked-by: Kai Huang <[email protected]> Acked-by: Rafael J. Wysocki <[email protected]> Tested-by: Tao Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-06-17x86/acpi: Extract ACPI MADT wakeup code into a separate fileKirill A. Shutemov5-85/+96
In order to prepare for the expansion of support for the ACPI MADT wakeup method, move the relevant code into a separate file. Introduce a new configuration option to clearly indicate dependencies without the use of ifdefs. There have been no functional changes. Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Baoquan He <[email protected]> Reviewed-by: Kuppuswamy Sathyanarayanan <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Acked-by: Borislav Petkov (AMD) <[email protected]> Acked-by: Kai Huang <[email protected]> Acked-by: Rafael J. Wysocki <[email protected]> Tested-by: Tao Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-06-17x86/kexec: Remove spurious unconditional JMP from from identity_mapped()Nikolay Borisov1-3/+0
This seemingly straightforward JMP was introduced in the initial version of the the 64bit kexec code without any explanation. It turns out (check accompanying Link) it's likely a copy/paste artefact from 32-bit code, where such a JMP could be used as a serializing instruction for the 486's prefetch queue. On x86_64 that's not needed because there's already a preceding write to cr4 which itself is a serializing operation. [ bp: Typos. Let's try this and see what cries out. If it does, reverting it is trivial. ] Signed-off-by: Nikolay Borisov <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Link: https://lore.kernel.org/all/[email protected]/