Age | Commit message (Collapse) | Author | Files | Lines |
|
Rename REMOVED_SPTE to FROZEN_SPTE so that it can be used for other
multi-part operations.
REMOVED_SPTE is used as a non-present intermediate value for multi-part
operations that can happen when a thread doesn't have an MMU write lock.
Today these operations are when removing PTEs.
However, future changes will want to use the same concept for setting a
PTE. In that case the REMOVED_SPTE name does not quite fit. So rename it
to FROZEN_SPTE so it can be used for both types of operations.
Also rename the relevant helpers and comments that refer to "removed"
within the context of the SPTE value. Take care to not update naming
referring the "remove" operations, which are still distinct.
Suggested-by: Paolo Bonzini <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>
Message-ID: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
|
|
The TDP MMU function __tdp_mmu_set_spte_atomic uses a cmpxchg64 to replace
the SPTE value and returns -EBUSY on failure. The caller must check the
return value and retry. Add __must_check to it, as well as to two more
functions that forward the return value of __tdp_mmu_set_spte_atomic to
their caller.
Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Binbin Wu <[email protected]>
Message-Id: <8f7d5a1b241bf5351eaab828d1a1efe5c17699ca.1705965635.git.isaku.yamahata@intel.com>
Acked-by: Kai Huang <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
I'm getting tired of telling people to put a magic "" in the
#define X86_FEATURE /* "" ... */
comment to hide the new feature flag from the user-visible
/proc/cpuinfo.
Flip the logic to make it explicit: an explicit "<name>" in the comment
adds the flag to /proc/cpuinfo and otherwise not, by default.
Add the "<name>" of all the existing flags to keep backwards
compatibility with userspace.
There should be no functional changes resulting from this.
Acked-by: Dave Hansen <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
An unintended consequence of commit 9c573cd31343 ("randomize_kstack:
Improve entropy diffusion") was that the per-architecture entropy size
filtering reduced how many bits were being added to the mix, rather than
how many bits were being used during the offsetting. All architectures
fell back to the existing default of 0x3FF (10 bits), which will consume
at most 1KiB of stack space. It seems that this is working just fine,
so let's avoid the confusion and update everything to use the default.
The prior intent of the per-architecture limits were:
arm64: capped at 0x1FF (9 bits), 5 bits effective
powerpc: uncapped (10 bits), 6 or 7 bits effective
riscv: uncapped (10 bits), 6 bits effective
x86: capped at 0xFF (8 bits), 5 (x86_64) or 6 (ia32) bits effective
s390: capped at 0xFF (8 bits), undocumented effective entropy
Current discussion has led to just dropping the original per-architecture
filters. The additional entropy appears to be safe for arm64, x86,
and s390. Quoting Arnd, "There is no point pretending that 15.75KB is
somehow safe to use while 15.00KB is not."
Co-developed-by: Yuntao Liu <[email protected]>
Signed-off-by: Yuntao Liu <[email protected]>
Fixes: 9c573cd31343 ("randomize_kstack: Improve entropy diffusion")
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Arnd Bergmann <[email protected]>
Acked-by: Mark Rutland <[email protected]>
Acked-by: Heiko Carstens <[email protected]> # s390
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Kees Cook <[email protected]>
|
|
Sync pending posted interrupts to the IRR prior to re-scanning I/O APIC
routes, irrespective of whether the I/O APIC is emulated by userspace or
by KVM. If a level-triggered interrupt routed through the I/O APIC is
pending or in-service for a vCPU, KVM needs to intercept EOIs on said
vCPU even if the vCPU isn't the destination for the new routing, e.g. if
servicing an interrupt using the old routing races with I/O APIC
reconfiguration.
Commit fceb3a36c29a ("KVM: x86: ioapic: Fix level-triggered EOI and
userspace I/OAPIC reconfigure race") fixed the common cases, but
kvm_apic_pending_eoi() only checks if an interrupt is in the local
APIC's IRR or ISR, i.e. misses the uncommon case where an interrupt is
pending in the PIR.
Failure to intercept EOI can manifest as guest hangs with Windows 11 if
the guest uses the RTC as its timekeeping source, e.g. if the VMM doesn't
expose a more modern form of time to the guest.
Cc: [email protected]
Cc: Adamos Ttofari <[email protected]>
Cc: Raghavendra Rao Ananta <[email protected]>
Reviewed-by: Jim Mattson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Message-ID: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Some instructions are only available on the 64-bit architecture.
Bi-arch compilers that default to -m32 need the explicit -m64 option
to evaluate them properly.
Fixes: 18e66b695e78 ("x86/shstk: Add Kconfig option for shadow stack")
Closes: https://lore.kernel.org/all/[email protected]/
Reported-by: Dmitry Safonov <[email protected]>
Signed-off-by: Masahiro Yamada <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Tested-by: Dmitry Safonov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Since FineIBT performs checking at the destination, it is weaker against
attacks that can construct arbitrary executable memory contents. As such,
some system builders want to run with FineIBT disabled by default. Allow
the "cfi=kcfi" boot param mode to be selectable through Kconfig via the
newly introduced CONFIG_CFI_AUTO_DEFAULT.
Reviewed-by: Sami Tolvanen <[email protected]>
Reviewed-by: Nathan Chancellor <[email protected]>
Tested-by: Nathan Chancellor <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Kees Cook <[email protected]>
|
|
This switches x86-64 over to using 'tzcount' instead of the integer
multiply trick to turn the bytemask information into actual byte counts.
We even had a comment saying that a fast bit count instruction is better
than a multiply, but x86 bit counting has traditionally been
"questionably fast", and so avoiding it was the right thing back in the
days.
Now, on any half-way modern core, using bit counting is cheaper and
smaller than the large constant multiply, so let's just switch over.
Note that as part of switching over to counting bits, we also do it at a
different point. We used to create the byte count from the final byte
mask, but once you use the 'tzcount' instruction (aka 'bsf' on older
CPU's), you can actually count the leading zeroes using a value we have
available earlier.
In fact, we can just use the very first mask of bits that tells us
whether we have any zero bytes at all. The zero bytes in the word will
have the high bit set, so just doing 'tzcount' on that value and
dividing by 8 will give the number of bytes that precede the first NUL
character, which is exactly what we want.
Note also that the input value to the tzcount is by definition not zero,
since that is the condition that we already used to check the whole "do
we have any zero bytes at all". So we don't need to worry about the
legacy instruction behavior of pre-lzcount days when 'bsf' didn't have a
result for zero input.
The 32-bit code continues to use the bimple bit op trick that is faster
even on newer cores, but particularly on the older 32-bit-only ones.
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This implements the runtime constant infrastructure for x86, allowing
the dcache d_hash() function to be generated using as a constant for
hash table address followed by shift by a constant of the hash index.
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Streamline the 8-byte case and drop the special handling. Use a macro
which hides the exception handling.
No functional changes.
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Link: https://lore.kernel.org/r/CAHk-=whYb2L_atsRk9pBiFiVLGe5wNZLHhRinA69yu6FiKvDsw@mail.gmail.com
|
|
Commit
6791e0ea3071 ("x86/resctrl: Access per-rmid structures by index")
adds logic to map individual monitoring groups into a global index space used
for tracking allocated RMIDs.
Attempts to free the default RMID are ignored in free_rmid(), and this works
fine on x86.
With arm64 MPAM, there is a latent bug here however: on platforms with no
monitors exposed through resctrl, each control group still gets a different
monitoring group ID as seen by the hardware, since the CLOSID always forms part
of the monitoring group ID.
This means that when removing a control group, the code may try to free this
group's default monitoring group RMID for real. If there are no monitors
however, the RMID tracking table rmid_ptrs[] would be a waste of memory and is
never allocated, leading to a splat when free_rmid() tries to dereference the
table.
One option would be to treat RMID 0 as special for every CLOSID, but this would
be ugly since bookkeeping still needs to be done for these monitoring group IDs
when there are monitors present in the hardware.
Instead, add a gating check of resctrl_arch_mon_capable() in free_rmid(), and
just do nothing if the hardware doesn't have monitors.
This fix mirrors the gating checks already present in
mkdir_rdt_prepare_rmid_alloc() and elsewhere.
No functional change on x86.
[ bp: Massage commit message. ]
Fixes: 6791e0ea3071 ("x86/resctrl: Access per-rmid structures by index")
Signed-off-by: Dave Martin <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Tested-by: Reinette Chatre <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Sync to v6.10-rc3.
Signed-off-by: Jani Nikula <[email protected]>
|
|
Introduce vcpu->wants_to_run to indicate when a vCPU is in its core run
loop, i.e. when the vCPU is running the KVM_RUN ioctl and immediate_exit
was not set.
Replace all references to vcpu->run->immediate_exit with
!vcpu->wants_to_run to avoid TOCTOU races with userspace. For example, a
malicious userspace could invoked KVM_RUN with immediate_exit=true and
then after KVM reads it to set wants_to_run=false, flip it to false.
This would result in the vCPU running in KVM_RUN with
wants_to_run=false. This wouldn't cause any real bugs today but is a
dangerous landmine.
Signed-off-by: David Matlack <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
If the BSP vCPU ID was already set, ensure it doesn't get excluded when
limiting vCPU IDs via KVM_CAP_MAX_VCPU_ID.
[mks: provide commit message, code by Sean]
Signed-off-by: Mathias Krause <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Do not accept IDs which are definitely invalid by limit checking the
passed value against KVM_MAX_VCPU_IDS and 'max_vcpu_ids' if it was
already set.
This ensures invalid values, especially on 64-bit systems, don't go
unnoticed and lead to a valid id by chance when truncated by the final
assignment.
Fixes: 73880c80aa9c ("KVM: Break dependency between vcpu index in vcpus array and vcpu_id.")
Signed-off-by: Mathias Krause <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi
Pull EFI fixes from Ard Biesheuvel:
"Another small set of EFI fixes. Only the x86 one is likely to affect
any actual users (and has a cc:stable), but the issue it fixes was
only observed in an unusual context (kexec in a confidential VM).
- Ensure that EFI runtime services are not unmapped by PAN on ARM
- Avoid freeing the memory holding the EFI memory map inadvertently
on x86
- Avoid a false positive kmemleak warning on arm64"
* tag 'efi-fixes-for-v6.10-3' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
efi/arm64: Fix kmemleak false positive in arm64_efi_rt_init()
efi/x86: Free EFI memory map only when installing a new one.
efi/arm: Disable LPAE PAN when calling EFI runtime services
|
|
To allow execution at a level other than VMPL0, an SVSM must be present.
Allow the SEV-SNP guest to continue booting if an SVSM is detected and
the hypervisor supports the SVSM feature as indicated in the GHCB
hypervisor features bitmap.
[ bp: Massage a bit. ]
Signed-off-by: Tom Lendacky <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Link: https://lore.kernel.org/r/2ce7cf281cce1d0cba88f3f576687ef75dc3c953.1717600736.git.thomas.lendacky@amd.com
|
|
When an SVSM is present, the guest can also request attestation reports
from it. These SVSM attestation reports can be used to attest the SVSM
and any services running within the SVSM.
Extend the config-fs attestation support to provide such. This involves
creating four new config-fs attributes:
- 'service-provider' (input)
This attribute is used to determine whether the attestation request
should be sent to the specified service provider or to the SEV
firmware. The SVSM service provider is represented by the value
'svsm'.
- 'service_guid' (input)
Used for requesting the attestation of a single service within the
service provider. A null GUID implies that the SVSM_ATTEST_SERVICES
call should be used to request the attestation report. A non-null
GUID implies that the SVSM_ATTEST_SINGLE_SERVICE call should be used.
- 'service_manifest_version' (input)
Used with the SVSM_ATTEST_SINGLE_SERVICE call, the service version
represents a specific service manifest version be used for the
attestation report.
- 'manifestblob' (output)
Used to return the service manifest associated with the attestation
report.
Only display these new attributes when running under an SVSM.
[ bp: Massage.
- s/svsm_attestation_call/svsm_attest_call/g ]
Signed-off-by: Tom Lendacky <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Link: https://lore.kernel.org/r/965015dce3c76bb8724839d50c5dea4e4b5d598f.1717600736.git.thomas.lendacky@amd.com
|
|
Currently, the sev-guest driver uses the vmpck-0 key by default. When an
SVSM is present, the kernel is running at a VMPL other than 0 and the
vmpck-0 key is no longer available. If a specific vmpck key has not be
requested by the user via the vmpck_id module parameter, choose the
vmpck key based on the active VMPL level.
Signed-off-by: Tom Lendacky <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Link: https://lore.kernel.org/r/b88081c5d88263176849df8ea93e90a404619cab.1717600736.git.thomas.lendacky@amd.com
|
|
Requesting an attestation report from userspace involves providing the VMPL
level for the report. Currently any value from 0-3 is valid because Linux
enforces running at VMPL0.
When an SVSM is present, though, Linux will not be running at VMPL0 and only
VMPL values starting at the VMPL level Linux is running at to 3 are valid. In
order to allow userspace to determine the minimum VMPL value that can be
supplied to an attestation report, create a sysfs entry that can be used to
retrieve the current VMPL level of the kernel.
[ bp: Add CONFIG_SYSFS ifdeffery. ]
Signed-off-by: Tom Lendacky <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Link: https://lore.kernel.org/r/fff846da0d8d561f9fdaf297dcf8cd907545a25b.1717600736.git.thomas.lendacky@amd.com
|
|
The SVSM specification documents an alternative method of discovery for
the SVSM using a reserved CPUID bit and a reserved MSR. This is intended
for guest components that do not have access to the secrets page in
order to be able to call the SVSM (e.g. UEFI runtime services).
For the MSR support, a new reserved MSR 0xc001f000 has been defined. A #VC
should be generated when accessing this MSR. The #VC handler is expected
to ignore writes to this MSR and return the physical calling area address
(CAA) on reads of this MSR.
While the CPUID leaf is updated, allowing the creation of a CPU feature,
the code will continue to use the VMPL level as an indication of the
presence of an SVSM. This is because the SVSM can be called well before
the CPU feature is in place and a non-zero VMPL requires that an SVSM be
present.
Signed-off-by: Tom Lendacky <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Link: https://lore.kernel.org/r/4f93f10a2ff3e9f368fd64a5920d51bf38d0c19e.1717600736.git.thomas.lendacky@amd.com
|
|
Using the RMPADJUST instruction, the VMSA attribute can only be changed
at VMPL0. An SVSM will be present when running at VMPL1 or a lower
privilege level.
In that case, use the SVSM_CORE_CREATE_VCPU call or the
SVSM_CORE_DESTROY_VCPU call to perform VMSA attribute changes. Use the
VMPL level supplied by the SVSM for the VMSA when starting the AP.
[ bp: Fix typo + touchups. ]
Signed-off-by: Tom Lendacky <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Link: https://lore.kernel.org/r/bcdd95ecabe9723673b9693c7f1533a2b8f17781.1717600736.git.thomas.lendacky@amd.com
|
|
The PVALIDATE instruction can only be performed at VMPL0. If an SVSM is
present, it will be running at VMPL0 while the guest itself is then
running at VMPL1 or a lower privilege level.
In that case, use the SVSM_CORE_PVALIDATE call to perform memory
validation instead of issuing the PVALIDATE instruction directly.
The validation of a single 4K page is now explicitly identified as such
in the function name, pvalidate_4k_page(). The pvalidate_pages()
function is used for validating 1 or more pages at either 4K or 2M in
size. Each function, however, determines whether it can issue the
PVALIDATE directly or whether the SVSM needs to be invoked.
[ bp: Touchups. ]
[ Tom: fold in a fix for Coconut SVSM:
https://lore.kernel.org/r/[email protected] ]
Signed-off-by: Tom Lendacky <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Link: https://lore.kernel.org/r/4c4017d8b94512d565de9ccb555b1a9f8983c69c.1717600736.git.thomas.lendacky@amd.com
|
|
Unknown uncore PMON types can be found in both SPR and EMR with HBM or
CXL.
$ls /sys/devices/ | grep type
uncore_type_12_16
uncore_type_12_18
uncore_type_12_2
uncore_type_12_4
uncore_type_12_6
uncore_type_12_8
uncore_type_13_17
uncore_type_13_19
uncore_type_13_3
uncore_type_13_5
uncore_type_13_7
uncore_type_13_9
The unknown PMON types are HBM and CXL PMON. Except for the name, the
other information regarding the HBM and CXL PMON counters can be
retrieved via the discovery table. Add them into the uncores tables for
SPR and EMR.
The event config registers for all CXL related units are 8-byte apart.
Add SPR_UNCORE_MMIO_OFFS8_COMMON_FORMAT to specially handle it.
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Yunying Sun <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The unit control and ID information are retrieved from the unit control
RB tree. No one uses the old structure anymore. Remove them.
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Yunying Sun <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The unit control RB tree has the unit control and unit ID information
for all the PCI units. Use them to replace the box_ctls/pci_offsets to
get an accurate unit control address for PCI uncore units.
The UPI/M3UPI units in the discovery table are ignored. Please see the
commit 65248a9a9ee1 ("perf/x86/uncore: Add a quirk for UPI on SPR").
Manually allocate a unit control RB tree for UPI/M3UPI.
Add cleanup_extra_boxes to release such manual allocation.
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Yunying Sun <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The unit control RB tree has the unit control and unit ID information
for all the MSR units. Use them to replace the box_ctl and
uncore_msr_box_ctl() to get an accurate unit control address for MSR
uncore units.
Add intel_generic_uncore_assign_hw_event(), which utilizes the accurate
unit control address from the unit control RB tree to calculate the
config_base and event_base.
The unit id related information should be retrieved from the unit
control RB tree as well.
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Yunying Sun <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The unit control RB tree has the unit control and unit ID information
for all the units. Use it to replace the box_ctls/mmio_offsets to get
an accurate unit control address for MMIO uncore units.
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Yunying Sun <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The box_ids only save the unit ID for the first die. If a unit, e.g., a
CXL unit, doesn't exist in the first die. The unit ID cannot be
retrieved.
The unit control RB tree also stores the unit ID information.
Retrieve the unit ID from the unit control RB tree
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Yunying Sun <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The cpumask of some uncore units, e.g., CXL uncore units, may be wrong
under some configurations. Perf may access an uncore counter of a
non-existent uncore unit.
The uncore driver assumes that all uncore units are symmetric among
dies. A global cpumask is shared among all uncore PMUs. However, some
CXL uncore units may only be available on some dies.
A per PMU cpumask is introduced to track the CPU mask of this PMU.
The driver searches the unit control RB tree to check whether the PMU is
available on a given die, and updates the per PMU cpumask accordingly.
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Yunying Sun <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The unit control address of some CXL units may be wrongly calculated
under some configuration on a EMR machine.
The current implementation only saves the unit control address of the
units from the first die, and the first unit of the rest of dies. Perf
assumed that the units from the other dies have the same offset as the
first die. So the unit control address of the rest of the units can be
calculated. However, the assumption is wrong, especially for the CXL
units.
Introduce an RB tree for each uncore type to save the unit control
address and three kinds of ID information (unit ID, PMU ID, and die ID)
for all units.
The unit ID is a physical ID of a unit.
The PMU ID is a logical ID assigned to a unit. The logical IDs start
from 0 and must be contiguous. The physical ID and the logical ID are
1:1 mapping. The units with the same physical ID in different dies share
the same PMU.
The die ID indicates which die a unit belongs to.
The RB tree can be searched by two different keys (unit ID or PMU ID +
die ID). During the RB tree setup, the unit ID is used as a key to look
up the RB tree. The perf can create/assign a proper PMU ID to the unit.
Later, after the RB tree is setup, PMU ID + die ID is used as a key to
look up the RB tree to fill the cpumask of a PMU. It's used more
frequently, so PMU ID + die ID is compared in the unit_less().
The uncore_find_unit() has to be O(N). But the RB tree setup only occurs
once during the driver load time. It should be acceptable.
Compared with the current implementation, more space is required to save
the information of all units. The extra size should be acceptable.
For example, on EMR, there are 221 units at most. For a 2-socket machine,
the extra space is ~6KB at most.
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
MADT Multiprocessor Wakeup structure version 1 brings support for CPU offlining:
BIOS provides a reset vector where the CPU has to jump to for offlining itself.
The new TEST mailbox command can be used to test whether the CPU offlined itself
which means the BIOS has control over the CPU and can online it again via the
ACPI MADT wakeup method.
Add CPU offlining support for the ACPI MADT wakeup method by implementing custom
cpu_die(), play_dead() and stop_this_cpu() SMP operations.
CPU offlining makes it possible to hand over secondary CPUs over kexec, not
limiting the second kernel to a single CPU.
The change conforms to the approved ACPI spec change proposal. See the Link.
Signed-off-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Reviewed-by: Kuppuswamy Sathyanarayanan <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Acked-by: Kai Huang <[email protected]>
Acked-by: Rafael J. Wysocki <[email protected]>
Tested-by: Tao Liu <[email protected]>
Link: https://lore.kernel.org/all/13356251.uLZWGnKmhe@kreacher
Link: https://lore.kernel.org/r/[email protected]
|
|
The helper complements kernel_ident_mapping_init(): it frees the identity
mapping that was previously allocated. It will be used in the error path to free
a partially allocated mapping or if the mapping is no longer needed.
The caller provides a struct x86_mapping_info with the free_pgd_page() callback
hooked up and the pgd_t to free.
Signed-off-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Acked-by: Kai Huang <[email protected]>
Tested-by: Tao Liu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
If the helper is defined, it is called instead of halt() to stop the CPU at the
end of stop_this_cpu() and on crash CPU shutdown.
ACPI MADT will use it to hand over the CPU to BIOS in order to be able to wake
it up again after kexec.
Signed-off-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Acked-by: Kai Huang <[email protected]>
Tested-by: Tao Liu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
ACPI MADT doesn't allow to offline a CPU after it was onlined. This limits
kexec: the second kernel won't be able to use more than one CPU.
To prevent a kexec kernel from onlining secondary CPUs, invalidate the mailbox
address in the ACPI MADT wakeup structure which prevents a kexec kernel to use
it.
This is safe as the booting kernel has the mailbox address cached already and
acpi_wakeup_cpu() uses the cached value to bring up the secondary CPUs.
Note: This is a Linux specific convention and not covered by the ACPI
specification.
Signed-off-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Reviewed-by: Kai Huang <[email protected]>
Reviewed-by: Kuppuswamy Sathyanarayanan <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Acked-by: Rafael J. Wysocki <[email protected]>
Tested-by: Tao Liu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
In order to support MADT wakeup structure version 1, provide more appropriate
names for the fields in the structure.
Rename 'mailbox_version' to 'version'. This field signifies the version of the
structure and the related protocols, rather than the version of the mailbox.
This field has not been utilized in the code thus far.
Rename 'base_address' to 'mailbox_address' to clarify the kind of address it
represents. In version 1, the structure includes the reset vector address. Clear
and distinct naming helps to prevent any confusion.
Signed-off-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Reviewed-by: Kai Huang <[email protected]>
Reviewed-by: Kuppuswamy Sathyanarayanan <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Acked-by: Rafael J. Wysocki <[email protected]>
Tested-by: Tao Liu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
kdump
During crashkernel boot only pre-allocated crash memory is presented as
E820_TYPE_RAM.
This can cause page table entries mapping unaccepted memory table to be zapped
during phys_pte_init(), phys_pmd_init(), phys_pud_init() and phys_p4d_init() as
SNP/TDX guest use E820_TYPE_ACPI to store the unaccepted memory table and pass
it between the kernels on kexec/kdump.
E820_TYPE_ACPI covers not only ACPI data, but also EFI tables and might be
required by kernel to function properly.
The problem was discovered during debugging kdump for SNP guest. The unaccepted
memory table stored with E820_TYPE_ACPI and passed between the kernels on kdump
was getting zapped as the PMD entry mapping this is above the E820_TYPE_RAM
range for the reserved crashkernel memory.
Signed-off-by: Ashish Kalra <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
e820__end_of_ram_pfn() is used to calculate max_pfn which, among other things,
guides where direct mapping ends. Any memory above max_pfn is not going to be
present in the direct mapping.
e820__end_of_ram_pfn() finds the end of the RAM based on the highest
E820_TYPE_RAM range. But it doesn't includes E820_TYPE_ACPI ranges into
calculation.
Despite the name, E820_TYPE_ACPI covers not only ACPI data, but also EFI tables
and might be required by kernel to function properly.
Usually the problem is hidden because there is some E820_TYPE_RAM memory above
E820_TYPE_ACPI. But crashkernel only presents pre-allocated crash memory as
E820_TYPE_RAM on boot. If the pre-allocated range is small, it can fit under the
last E820_TYPE_ACPI range.
Modify e820__end_of_ram_pfn() and e820__end_of_low_ram_pfn() to cover
E820_TYPE_ACPI memory.
The problem was discovered during debugging kexec for TDX guest. TDX guest uses
E820_TYPE_ACPI to store the unaccepted memory bitmap and pass it between the
kernels on kexec.
Signed-off-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Reviewed-by: Dave Hansen <[email protected]>
Tested-by: Tao Liu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
TDX guests allocate shared buffers to perform I/O. It is done by allocating
pages normally from the buddy allocator and converting them to shared with
set_memory_decrypted().
The second, kexec-ed kernel has no idea what memory is converted this way. It
only sees E820_TYPE_RAM.
Accessing shared memory via private mapping is fatal. It leads to unrecoverable
TD exit.
On kexec, walk direct mapping and convert all shared memory back to private. It
makes all RAM private again and second kernel may use it normally.
The conversion occurs in two steps: stopping new conversions and unsharing all
memory. In the case of normal kexec, the stopping of conversions takes place
while scheduling is still functioning. This allows for waiting until any ongoing
conversions are finished. The second step is carried out when all CPUs except one
are inactive and interrupts are disabled. This prevents any conflicts with code
that may access shared memory.
Signed-off-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Reviewed-by: Rick Edgecombe <[email protected]>
Reviewed-by: Kai Huang <[email protected]>
Tested-by: Tao Liu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
AMD SEV and Intel TDX guests allocate shared buffers for performing I/O.
This is done by allocating pages normally from the buddy allocator and
then converting them to shared using set_memory_decrypted().
On kexec, the second kernel is unaware of which memory has been
converted in this manner. It only sees E820_TYPE_RAM. Accessing shared
memory as private is fatal.
Therefore, the memory state must be reset to its original state before
starting the new kernel with kexec.
The process of converting shared memory back to private occurs in two
steps:
- enc_kexec_begin() stops new conversions.
- enc_kexec_finish() unshares all existing shared memory, reverting it
back to private.
Signed-off-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Reviewed-by: Nikolay Borisov <[email protected]>
Reviewed-by: Kai Huang <[email protected]>
Tested-by: Tao Liu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The kernel will convert all shared memory back to private during kexec.
The direct mapping page tables will provide information on which memory
is shared.
It is extremely important to convert all shared memory. If a page is
missed, it will cause the second kernel to crash when it accesses it.
Keep track of the number of shared pages. This will allow for
cross-checking against the shared information in the direct mapping and
reporting if the shared bit is lost.
Memory conversion is slow and does not happen often. Global atomic is
not going to be a bottleneck.
Signed-off-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Reviewed-by: Kai Huang <[email protected]>
Tested-by: Tao Liu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Currently, lookup_address() returns two things:
1. A "pte_t" (which might be a p[g4um]d_t)
2. The 'level' of the page tables where the "pte_t" was found
(returned via a pointer)
If no pte_t is found, 'level' is essentially garbage.
Always fill out the level. For NULL "pte_t"s, fill in the level where
the p*d_none() entry was found mirroring the "found" behavior.
Always filling out the level allows using lookup_address() to precisely skip
over holes when walking kernel page tables.
Add one more entry into enum pg_level to indicate the size of the VA
covered by one PGD entry in 5-level paging mode.
Update comments for lookup_address() and lookup_address_in_pgd() to
reflect changes in the interface.
Signed-off-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Reviewed-by: Rick Edgecombe <[email protected]>
Reviewed-by: Baoquan He <[email protected]>
Reviewed-by: Dave Hansen <[email protected]>
Tested-by: Tao Liu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
TDX is going to have more than one reason to fail enc_status_change_prepare().
Change the callback to return errno instead of assuming -EIO. Change
enc_status_change_finish() too to keep the interface symmetric.
Signed-off-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Reviewed-by: Dave Hansen <[email protected]>
Reviewed-by: Kai Huang <[email protected]>
Reviewed-by: Michael Kelley <[email protected]>
Tested-by: Tao Liu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
TDX guests run with MCA enabled (CR4.MCE=1b) from the very start. If
that bit is cleared during CR4 register reprogramming during boot or kexec
flows, a #VE exception will be raised which the guest kernel cannot handle.
Therefore, make sure the CR4.MCE setting is preserved over kexec too and avoid
raising any #VEs.
Signed-off-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
That identity_mapped() function was loving that "1" label to the point of
completely confusing its readers.
Use named labels in each place for clarity.
No functional changes.
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
ACPI MADT doesn't allow to offline a CPU after it has been woken up.
Currently, CPU hotplug is prevented based on the confidential computing
attribute which is set for Intel TDX. But TDX is not the only possible user of
the wake up method. Any platform that uses ACPI MADT wakeup method cannot
offline CPU.
Disable CPU offlining on ACPI MADT wakeup enumeration.
This has no visible effects for users: currently, TDX guest is the only platform
that uses the ACPI MADT wakeup method.
Signed-off-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Acked-by: Rafael J. Wysocki <[email protected]>
Tested-by: Tao Liu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
acpi_mp_wake_mailbox_paddr and acpi_mp_wake_mailbox are initialized once during
ACPI MADT init and never changed.
Signed-off-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Reviewed-by: Baoquan He <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Acked-by: Kai Huang <[email protected]>
Acked-by: Rafael J. Wysocki <[email protected]>
Tested-by: Tao Liu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
In order to prepare for the expansion of support for the ACPI MADT
wakeup method, move the relevant code into a separate file.
Introduce a new configuration option to clearly indicate dependencies
without the use of ifdefs.
There have been no functional changes.
Signed-off-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Reviewed-by: Baoquan He <[email protected]>
Reviewed-by: Kuppuswamy Sathyanarayanan <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Acked-by: Borislav Petkov (AMD) <[email protected]>
Acked-by: Kai Huang <[email protected]>
Acked-by: Rafael J. Wysocki <[email protected]>
Tested-by: Tao Liu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
This seemingly straightforward JMP was introduced in the initial version
of the the 64bit kexec code without any explanation.
It turns out (check accompanying Link) it's likely a copy/paste artefact
from 32-bit code, where such a JMP could be used as a serializing
instruction for the 486's prefetch queue. On x86_64 that's not needed
because there's already a preceding write to cr4 which itself is
a serializing operation.
[ bp: Typos. Let's try this and see what cries out. If it does,
reverting it is trivial. ]
Signed-off-by: Nikolay Borisov <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Link: https://lore.kernel.org/all/[email protected]/
|