Age | Commit message (Collapse) | Author | Files | Lines |
|
There is no caller in tree since introduction in commit b4f69df0f65e ("KVM:
x86: Make Hyper-V emulation optional")
Signed-off-by: Yue Haibing <[email protected]>
Message-ID: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
x2apic_disable() clears x2apic_state and x2apic_mode unconditionally, even
when the state is X2APIC_ON_LOCKED, which prevents the kernel to disable
it thereby creating inconsistent state.
Due to the early state check for X2APIC_ON, the code path which warns about
a locked X2APIC cannot be reached.
Test for state < X2APIC_ON instead and move the clearing of the state and
mode variables to the place which actually disables X2APIC.
[ tglx: Massaged change log. Added Fixes tag. Moved clearing so it's at the
right place for back ports ]
Fixes: a57e456a7b28 ("x86/apic: Fix fallout from x2apic cleanup")
Signed-off-by: Yuntao Wang <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/all/[email protected]
|
|
The copy_from_user() function returns the number of bytes which it
was not able to copy. Return -EFAULT instead.
Fixes: dee5a47cc7a4 ("KVM: SEV: Add KVM_SEV_SNP_LAUNCH_UPDATE command")
Signed-off-by: Dan Carpenter <[email protected]>
Message-ID: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
If snp_lookup_rmpentry() fails then "assigned" is printed in the error
message but it was never initialized. Initialize it to false.
Fixes: dee5a47cc7a4 ("KVM: SEV: Add KVM_SEV_SNP_LAUNCH_UPDATE command")
Signed-off-by: Dan Carpenter <[email protected]>
Message-ID: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
commit f5a3562ec9dd ("x86/irq: Reserve a per CPU IDT vector for posted
MSIs") changed the first system vector from LOCAL_TIMER_VECTOR to
POSTED_MSI_NOTIFICATION_VECTOR. Reflect this change in the vector layout
comment as well.
However, instead of pointing to the specific vector, use the
FIRST_SYSTEM_VECTOR indirection which essentially refers to the same.
This avoids unnecessary modifications to the same comment whenever
additional system vectors get added.
Signed-off-by: Sohil Mehta <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Jacob Pan <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
The removal of get_physical_broadcast(), generic_bigsmp_probe() and
default_apic_id_valid() left the stale declarations around.
Remove them.
Signed-off-by: Yue Haibing <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
For any changes of struct fd representation we need to
turn existing accesses to fields into calls of wrappers.
Accesses to struct fd::flags are very few (3 in linux/file.h,
1 in net/socket.c, 3 in fs/overlayfs/file.c and 3 more in
explicit initializers).
Those can be dealt with in the commit converting to
new layout; accesses to struct fd::file are too many for that.
This commit converts (almost) all of f.file to
fd_file(f). It's not entirely mechanical ('file' is used as
a member name more than just in struct fd) and it does not
even attempt to distinguish the uses in pointer context from
those in boolean context; the latter will be eventually turned
into a separate helper (fd_empty()).
NOTE: mass conversion to fd_empty(), tempting as it
might be, is a bad idea; better do that piecewise in commit
that convert from fdget...() to CLASS(...).
[conflicts in fs/fhandle.c, kernel/bpf/syscall.c, mm/memcontrol.c
caught by git; fs/stat.c one got caught by git grep]
[fs/xattr.c conflict]
Reviewed-by: Christian Brauner <[email protected]>
Signed-off-by: Al Viro <[email protected]>
|
|
Structural Based Functional Test at Field (SBAF) is a new type of
testing that provides comprehensive core test coverage complementing
existing IFS tests like Scan at Field (SAF) or ArrayBist.
SBAF device will appear as a new device instance (intel_ifs_2) under
/sys/devices/virtual/misc. The user interaction necessary to load the
test image and test a particular core is the same as the existing scan
test (intel_ifs_0).
During the loading stage, the driver will look for a file named
ff-mm-ss-<batch02x>.sbft in the /lib/firmware/intel/ifs_2 directory.
The hardware interaction needed for loading the image is similar to
SAF, with the only difference being the MSR addresses used. Reuse the
SAF image loading code, passing the SBAF-specific MSR addresses via
struct ifs_test_msrs in the driver device data.
Unlike SAF, the SBAF test image chunks are further divided into smaller
logical entities called bundles. Since the SBAF test is initiated per
bundle, cache the maximum number of bundles in the current image, which
is used for iterating through bundles during SBAF test execution.
Reviewed-by: Ashok Raj <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Reviewed-by: Ilpo Järvinen <[email protected]>
Signed-off-by: Jithu Joseph <[email protected]>
Co-developed-by: Kuppuswamy Sathyanarayanan <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Link: https://lore.kernel.org/r/20240801051814.1935149-3-sathyanarayanan.kuppuswamy@linux.intel.com
Signed-off-by: Hans de Goede <[email protected]>
|
|
Commit 6fd166aae78c ("x86/mm: Use/Fix PCID to optimize user/kernel
switches") removed the last usage of CR3_HW_ASID_BITS and opted to use
X86_CR3_PCID_BITS instead. Remove CR3_HW_ASID_BITS.
Signed-off-by: Yosry Ahmed <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
On PREEMPT_RT, kfree() takes sleeping locks and must not be called with
preemption disabled. Therefore, on PREEMPT_RT skcipher_walk_done() must
not be called from within a kernel_fpu_{begin,end}() pair, even when
it's the last call which is guaranteed to not allocate memory.
Therefore, move the last skcipher_walk_done() in gcm_crypt() to the end
of the function so that it goes after the kernel_fpu_end(). To make
this work cleanly, rework the data processing loop to handle only
non-last data segments.
Fixes: b06affb1cb58 ("crypto: x86/aes-gcm - add VAES and AVX512 / AVX10 optimized AES-GCM")
Reported-by: Sebastian Andrzej Siewior <[email protected]>
Closes: https://lore.kernel.org/linux-crypto/[email protected]
Signed-off-by: Eric Biggers <[email protected]>
Tested-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Herbert Xu <[email protected]>
|
|
Logical destination mode of the local APIC is used for systems with up to
8 CPUs. It has an advantage over physical destination mode as it allows to
target multiple CPUs at once with IPIs.
That advantage was definitely worth it when systems with up to 8 CPUs
were state of the art for servers and workstations, but that's history.
Aside of that there are systems which fail to work with logical destination
mode as the ACPI/DMI quirks show and there are AMD Zen1 systems out there
which fail when interrupt remapping is enabled as reported by Rob and
Christian. The latter problem can be cured by firmware updates, but not all
OEMs distribute the required changes.
Physical destination mode is guaranteed to work because it is the only way
to get a CPU up and running via the INIT/INIT/STARTUP sequence.
As the number of CPUs keeps increasing, logical destination mode becomes a
less used code path so there is no real good reason to keep it around.
Therefore remove logical destination mode support for 64-bit and default to
physical destination mode.
Reported-by: Rob Newcater <[email protected]>
Reported-by: Christian Heusel <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Borislav Petkov (AMD) <[email protected]>
Tested-by: Rob Newcater <[email protected]>
Link: https://lore.kernel.org/all/877cd5u671.ffs@tglx
|
|
Stack unwinding produces large amounts of uninteresting coverage.
It's called from KASAN kmalloc/kfree hooks, fault injection, etc.
It's not particularly useful and is not a function of system call args.
Ignore that code.
Signed-off-by: Dmitry Vyukov <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Alexander Potapenko <[email protected]>
Reviewed-by: Marco Elver <[email protected]>
Reviewed-by: Andrey Konovalov <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/all/eaf54b8634970b73552dcd38bf9be6ef55238c10.1718092070.git.dvyukov@google.com
|
|
common_interrupt() and related variants call kvm_set_cpu_l1tf_flush_l1d(),
which is neither marked noinstr nor __always_inline.
So compiler puts it out of line and adds instrumentation to it. Since the
call is inside of instrumentation_begin/end(), objtool does not warn about
it.
The manifestation is that KCOV produces spurious coverage in
kvm_set_cpu_l1tf_flush_l1d() in random places because the call happens when
preempt count is not yet updated to say that the kernel is in an interrupt.
Mark kvm_set_cpu_l1tf_flush_l1d() as __always_inline and move it out of the
instrumentation_begin/end() section. It only calls __this_cpu_write()
which is already safe to call in noinstr contexts.
Fixes: 6368558c3710 ("x86/entry: Provide IDTENTRY_SYSVEC")
Signed-off-by: Dmitry Vyukov <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Alexander Potapenko <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/all/3f9a1de9e415fcb53d07dc9e19fa8481bb021b1b.1718092070.git.dvyukov@google.com
|
|
This per CPU log is becoming longer with more and more CPUs in system,
which slows down the boot process due to the serializing nature of
printk().
The value of this information is dubious and it can be retrieved by lscpu
from user space if required..
Downgrade the printk() to pr_debug() so it is still accessible for debug
purposes.
[ tglx: Massaged changelog ]
Signed-off-by: Li RongQing <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
MTRRs have an obsolete fixed variant for fine grained caching control
of the 640K-1MB region that uses separate MSRs. This fixed variant has
a separate capability bit in the MTRR capability MSR.
So far all x86 CPUs which support MTRR have this separate bit set, so it
went unnoticed that mtrr_save_state() does not check the capability bit
before accessing the fixed MTRR MSRs.
Though on a CPU that does not support the fixed MTRR capability this
results in a #GP. The #GP itself is harmless because the RDMSR fault is
handled gracefully, but results in a WARN_ON().
Add the missing capability check to prevent this.
Fixes: 2b1f6278d77c ("[PATCH] x86: Save the MTRRs of the BSP before booting an AP")
Signed-off-by: Andi Kleen <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/all/[email protected]
|
|
The kernel can change spinlock behavior when running as a guest. But this
guest-friendly behavior causes performance problems on bare metal.
The kernel uses a static key to switch between the two modes.
In theory, the static key is enabled by default (run in guest mode) and
should be disabled for bare metal (and in some guests that want native
behavior or paravirt spinlock).
A performance drop is reported when running encode/decode workload and
BenchSEE cache sub-workload.
Bisect points to commit ce0a1b608bfc ("x86/paravirt: Silence unused
native_pv_lock_init() function warning"). When CONFIG_PARAVIRT_SPINLOCKS is
disabled the virt_spin_lock_key is incorrectly set to true on bare
metal. The qspinlock degenerates to test-and-set spinlock, which decreases
the performance on bare metal.
Set the default value of virt_spin_lock_key to false. If booting in a VM,
enable this key. Later during the VM initialization, if other
high-efficient spinlock is preferred (e.g. paravirt-spinlock), or the user
wants the native qspinlock (via nopvspin boot commandline), the
virt_spin_lock_key is disabled accordingly.
This results in the following decision matrix:
X86_FEATURE_HYPERVISOR Y Y Y N
CONFIG_PARAVIRT_SPINLOCKS Y Y N Y/N
PV spinlock Y N N Y/N
virt_spin_lock_key N Y/N Y N
Fixes: ce0a1b608bfc ("x86/paravirt: Silence unused native_pv_lock_init() function warning")
Reported-by: Prem Nath Dey <[email protected]>
Reported-by: Xiaoping Zhou <[email protected]>
Suggested-by: Dave Hansen <[email protected]>
Suggested-by: Qiuxu Zhuo <[email protected]>
Suggested-by: Nikolay Borisov <[email protected]>
Signed-off-by: Chen Yu <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Nikolay Borisov <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/all/[email protected]
|
|
On a platform using the "Multiprocessor Wakeup Structure"[1] to startup
secondary CPUs the control processor needs to memremap() the physical
address of the MP Wakeup Structure mailbox to the variable
acpi_mp_wake_mailbox, which holds the virtual address of mailbox.
To wake up the AP the control processor writes the APIC ID of AP, the
wakeup vector and the ACPI_MP_WAKE_COMMAND_WAKEUP command into the mailbox.
Current implementation doesn't consider the case which restricts boot time
CPU bringup to 1 with the kernel parameter "maxcpus=1" and brings other
CPUs online later from user space as it sets acpi_mp_wake_mailbox to
read-only after init. So when the first AP is tried to brought online
after init, the attempt to update the variable results in a kernel panic.
The memremap() call that initializes the variable cannot be moved into
acpi_parse_mp_wake() because memremap() is not functional at that point in
the boot process. Also as the APs might never be brought up, keep the
memremap() call in acpi_wakeup_cpu() so that the operation only takes place
when needed.
Fixes: 24dd05da8c79 ("x86/apic: Mark acpi_mp_wake_* variables as __ro_after_init")
Signed-off-by: Zhiquan Li <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Kirill A. Shutemov <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
Commit 2744a7ce34a7 ("x86/apic: Replace acpi_wake_cpu_handler_update() and
apic_set_eoi_cb()") left it unused.
Signed-off-by: Yue Haibing <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
Add missing new lines and reorder variable definitions.
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Qiuxu Zhuo <[email protected]>
Tested-by: Breno Leitao <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
80 character limit is history.
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Qiuxu Zhuo <[email protected]>
Tested-by: Breno Leitao <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
Add brackets around if/for constructs as required by coding style or remove
pointless line breaks to make it true single line statements which do not
require brackets.
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Qiuxu Zhuo <[email protected]>
Tested-by: Breno Leitao <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
Use proper comment styles and shrink comments to their scope where
applicable.
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Qiuxu Zhuo <[email protected]>
Tested-by: Breno Leitao <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
It's only used by check_timer().
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Qiuxu Zhuo <[email protected]>
Tested-by: Breno Leitao <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
Use the new apic_pr_verbose() helper.
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Qiuxu Zhuo <[email protected]>
Tested-by: Breno Leitao <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
Cleanup the APIC printk()s which are inside of a apic verbosity guarded
region by using apic_dbg() for the KERN_DEBUG level prints.
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Qiuxu Zhuo <[email protected]>
Tested-by: Breno Leitao <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
Replace apic_printk($LEVEL) with the corresponding apic_pr_*() helpers and
use pr_info() for APIC_QUIET as that is always printed so the indirection
is pointless noise.
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Qiuxu Zhuo <[email protected]>
Tested-by: Breno Leitao <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
Use the new apic_pr_*() helpers and cleanup the apic_printk() maze.
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Qiuxu Zhuo <[email protected]>
Tested-by: Breno Leitao <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
apic_printk() requires the APIC verbosity level and printk level which is
tedious and horrible to read. Provide helpers to simplify all of that.
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Qiuxu Zhuo <[email protected]>
Tested-by: Breno Leitao <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
KISS rules!
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Qiuxu Zhuo <[email protected]>
Tested-by: Breno Leitao <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
Make them conforming to the TIP coding style guide.
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Qiuxu Zhuo <[email protected]>
Tested-by: Breno Leitao <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
Only invoked from check_timer() which is __init too. Cleanup the variable
declaration while at it.
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Qiuxu Zhuo <[email protected]>
Tested-by: Breno Leitao <[email protected]>
Reviewed-by: Breno Leitao <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
Breno observed panics when using failslab under certain conditions during
runtime:
can not alloc irq_pin_list (-1,0,20)
Kernel panic - not syncing: IO-APIC: failed to add irq-pin. Can not proceed
panic+0x4e9/0x590
mp_irqdomain_alloc+0x9ab/0xa80
irq_domain_alloc_irqs_locked+0x25d/0x8d0
__irq_domain_alloc_irqs+0x80/0x110
mp_map_pin_to_irq+0x645/0x890
acpi_register_gsi_ioapic+0xe6/0x150
hpet_open+0x313/0x480
That's a pointless panic which is a leftover of the historic IO/APIC code
which panic'ed during early boot when the interrupt allocation failed.
The only place which might justify panic is the PIT/HPET timer_check() code
which tries to figure out whether the timer interrupt is delivered through
the IO/APIC. But that code does not require to handle interrupt allocation
failures. If the interrupt cannot be allocated then timer delivery fails
and it either panics due to that or falls back to legacy mode.
Cure this by removing the panic wrapper around __add_pin_to_irq_node() and
making mp_irqdomain_alloc() aware of the failure condition and handle it as
any other failure in this function gracefully.
Reported-by: Breno Leitao <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Breno Leitao <[email protected]>
Tested-by: Qiuxu Zhuo <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
Link: https://lore.kernel.org/all/[email protected]
|
|
So it turns out that we have to do two passes of
pti_clone_entry_text(), once before initcalls, such that device and
late initcalls can use user-mode-helper / modprobe and once after
free_initmem() / mark_readonly().
Now obviously mark_readonly() can cause PMD splits, and
pti_clone_pgtable() doesn't like that much.
Allow the late clone to split PMDs so that pagetables stay in sync.
[peterz: Changelog and comments]
Reported-by: Guenter Roeck <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Guenter Roeck <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
Currently ARM64 extracts which specific sanitizer has caused a trap via
encoded data in the trap instruction. Clang on x86 currently encodes the
same data in the UD1 instruction but x86 handle_bug() and
is_valid_bugaddr() currently only look at UD2.
Bring x86 to parity with ARM64, similar to commit 25b84002afb9 ("arm64:
Support Clang UBSAN trap codes for better reporting"). See the llvm
links for information about the code generation.
Enable the reporting of UBSAN sanitizer details on x86 compiled with clang
when CONFIG_UBSAN_TRAP=y by analysing UD1 and retrieving the type immediate
which is encoded by the compiler after the UD1.
[ tglx: Simplified it by moving the printk() into handle_bug() ]
Signed-off-by: Gatlin Newhouse <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Kees Cook <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
Link: https://github.com/llvm/llvm-project/commit/c5978f42ec8e9#diff-bb68d7cd885f41cfc35843998b0f9f534adb60b415f647109e597ce448e92d9f
Link: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/X86/X86InstrSystem.td#L27
|
|
The default paranoid setting was updated in commit 0161028b7c8a
("perf/core: Change the default paranoia level to 2") so this comment is
no longer true.
Signed-off-by: James Clark <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
Some uncore PMON registers are located in the MMIO space of the Host
Bridge and DRAM Controller device, which is located at D0:F0 for
Tiger Lake and later client generation.
Use D0:F0 as a default device. So it doesn't need to keep adding the
complete Device ID list for each generation anymore.
Signed-off-by: Zhenyu Wang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Kan Liang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
LNL uncore imc freerunning counters keep same as previous HW.
Signed-off-by: Zhenyu Wang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Kan Liang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The uncore subsystem for Lunar Lake is similar to the previous
Meteor Lake. The uncore PerfMon registers are located at both
MSR and MMIO space.
The ARB and iMC are kept. There is no difference from the Meteor Lake.
Move the global control initialization to the first box of the CBOX.
The sNCU is moved to the MMIO space.
The HBO is newly added and only be accessed from the MMIO space.
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Some uncore PMON registers are located in the MMIO space. For the client
machine, the MMIO space is usually located at D0:F0 but in a different
BAR. For example, some uncore PMON registers are located in the SAF BAR,
not the MCHBAR in the Lunar Lake.
The current __uncore_imc_init_box() hard code the BAR information.
Factor out the uncore_get_box_mmio_addr() which uses the BAR information
as a parameter.
The only change is the error output message. The hardcode name 'MCHBAR'
is replaced by the offset of a BAR.
Add a new macro, MMIO_UNCORE_COMMON_OPS(), since the MMIO ops functions
are usually the same among different generations.
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
>From the perspective of the uncore PMU, the Arrow Lake is the same as
the previous Meteor Lake. The only difference is the event list, which
will be supported in the perf tool later.
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
When ident_pud_init() uses only GB pages to create identity maps, large
ranges of addresses not actually requested can be included in the resulting
table; a 4K request will map a full GB. This can include a lot of extra
address space past that requested, including areas marked reserved by the
BIOS. That allows processor speculation into reserved regions, that on UV
systems can cause system halts.
Only use GB pages when map creation requests include the full GB page of
space. Fall back to using smaller 2M pages when only portions of a GB page
are included in the request.
No attempt is made to coalesce mapping requests. If a request requires a
map entry at the 2M (pmd) level, subsequent mapping requests within the
same 1G region will also be at the pmd level, even if adjacent or
overlapping such requests could have been combined to map a full GB page.
Existing usage starts with larger regions and then adds smaller regions, so
this should not have any great consequence.
Signed-off-by: Steve Wahl <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Pavin Joseph <[email protected]>
Tested-by: Sarah Brofeldt <[email protected]>
Tested-by: Eric Hagberg <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
A kexec kernel boot failure is sometimes observed on AMD CPUs due to an
unmapped EFI config table array. This can be seen when "nogbpages" is on
the kernel command line, and has been observed as a full BIOS reboot rather
than a successful kexec.
This was also the cause of reported regressions attributed to Commit
7143c5f4cf20 ("x86/mm/ident_map: Use gbpages only where full GB page should
be mapped.") which was subsequently reverted.
To avoid this page fault, explicitly include the EFI config table array in
the kexec identity map.
Further explanation:
The following 2 commits caused the EFI config table array to be
accessed when enabling sev at kernel startup.
commit ec1c66af3a30 ("x86/compressed/64: Detect/setup SEV/SME features
earlier during boot")
commit c01fce9cef84 ("x86/compressed: Add SEV-SNP feature
detection/setup")
This is in the code that examines whether SEV should be enabled or not, so
it can even affect systems that are not SEV capable.
This may result in a page fault if the EFI config table array's address is
unmapped. Since the page fault occurs before the new kernel establishes its
own identity map and page fault routines, it is unrecoverable and kexec
fails.
Most often, this problem is not seen because the EFI config table array
gets included in the map by the luck of being placed at a memory address
close enough to other memory areas that *are* included in the map created
by kexec.
Both the "nogbpages" command line option and the "use gpbages only where
full GB page should be mapped" change greatly reduce the chance of being
included in the map by luck, which is why the problem appears.
Signed-off-by: Tao Liu <[email protected]>
Signed-off-by: Steve Wahl <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Pavin Joseph <[email protected]>
Tested-by: Sarah Brofeldt <[email protected]>
Tested-by: Eric Hagberg <[email protected]>
Reviewed-by: Ard Biesheuvel <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
- Prevent a deadlock on cpu_hotplug_lock in the aperf/mperf driver.
A recent change in the ACPI code which consolidated code pathes moved
the invocation of init_freq_invariance_cppc() to be moved to a CPU
hotplug handler. The first invocation on AMD CPUs ends up enabling a
static branch which dead locks because the static branch enable tries
to acquire cpu_hotplug_lock but that lock is already held write by
the hotplug machinery.
Use static_branch_enable_cpuslocked() instead and take the hotplug
lock read for the Intel code path which is invoked from the
architecture code outside of the CPU hotplug operations.
- Fix the number of reserved bits in the sev_config structure bit field
so that the bitfield does not exceed 64 bit.
- Add missing Zen5 model numbers
- Fix the alignment assumptions of pti_clone_pgtable() and
clone_entry_text() on 32-bit:
The code assumes PMD aligned code sections, but on 32-bit the kernel
entry text is not PMD aligned. So depending on the code size and
location, which is configuration and compiler dependent, entry text
can cross a PMD boundary. As the start is not PMD aligned adding PMD
size to the start address is larger than the end address which
results in partially mapped entry code for user space. That causes
endless recursion on the first entry from userspace (usually #PF).
Cure this by aligning the start address in the addition so it ends up
at the next PMD start address.
clone_entry_text() enforces PMD mapping, but on 32-bit the tail might
eventually be PTE mapped, which causes a map fail because the PMD for
the tail is not a large page mapping. Use PTI_LEVEL_KERNEL_IMAGE for
the clone() invocation which resolves to PTE on 32-bit and PMD on
64-bit.
- Zero the 8-byte case for get_user() on range check failure on 32-bit
The recend consolidation of the 8-byte get_user() case broke the
zeroing in the failure case again. Establish it by clearing ECX
before the range check and not afterwards as that obvioulsy can't be
reached when the range check fails
* tag 'x86-urgent-2024-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/uaccess: Zero the 8-byte get_range case on failure on 32-bit
x86/mm: Fix pti_clone_entry_text() for i386
x86/mm: Fix pti_clone_pgtable() alignment assumption
x86/setup: Parse the builtin command line before merging
x86/CPU/AMD: Add models 0x60-0x6f to the Zen5 range
x86/sev: Fix __reserved field in sev_config
x86/aperfmperf: Fix deadlock on cpu_hotplug_lock
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 perf fixes from Thomas Gleixner:
- Move the smp_processor_id() invocation back into the non-preemtible
region, so that the result is valid to use
- Add the missing package C2 residency counters for Sierra Forest CPUs
to make the newly added support actually useful
* tag 'perf-urgent-2024-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86: Fix smp_processor_id()-in-preemptible warnings
perf/x86/intel/cstate: Add pkg C2 residency counter for Sierra Forest
|
|
A Linux guest on Hyper-V gets the TSC frequency from a synthetic MSR, if
available. In this case, set X86_FEATURE_TSC_KNOWN_FREQ so that Linux
doesn't unnecessarily do refined TSC calibration when setting up the TSC
clocksource.
With this change, a message such as this is no longer output during boot
when the TSC is used as the clocksource:
[ 1.115141] tsc: Refined TSC clocksource calibration: 2918.408 MHz
Furthermore, the guest and host will have exactly the same view of the
TSC frequency, which is important for features such as the TSC deadline
timer that are emulated by the Hyper-V host.
Signed-off-by: Michael Kelley <[email protected]>
Reviewed-by: Roman Kisel <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Wei Liu <[email protected]>
Message-ID: <[email protected]>
|
|
Pull kvm updates from Paolo Bonzini:
"The bulk of the changes here is a largish change to guest_memfd,
delaying the clearing and encryption of guest-private pages until they
are actually added to guest page tables. This started as "let's make
it impossible to misuse the API" for SEV-SNP; but then it ballooned a
bit.
The new logic is generally simpler and more ready for hugepage support
in guest_memfd.
Summary:
- fix latent bug in how usage of large pages is determined for
confidential VMs
- fix "underline too short" in docs
- eliminate log spam from limited APIC timer periods
- disallow pre-faulting of memory before SEV-SNP VMs are initialized
- delay clearing and encrypting private memory until it is added to
guest page tables
- this change also enables another small cleanup: the checks in
SNP_LAUNCH_UPDATE that limit it to non-populated, private pages can
now be moved in the common kvm_gmem_populate() function
- fix compilation error that the RISC-V merge introduced in selftests"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86/mmu: fix determination of max NPT mapping level for private pages
KVM: riscv: selftests: Fix compile error
KVM: guest_memfd: abstract how prepared folios are recorded
KVM: guest_memfd: let kvm_gmem_populate() operate only on private gfns
KVM: extend kvm_range_has_memory_attributes() to check subset of attributes
KVM: cleanup and add shortcuts to kvm_range_has_memory_attributes()
KVM: guest_memfd: move check for already-populated page to common code
KVM: remove kvm_arch_gmem_prepare_needed()
KVM: guest_memfd: make kvm_gmem_prepare_folio() operate on a single struct kvm
KVM: guest_memfd: delay kvm_gmem_prepare_folio() until the memory is passed to the guest
KVM: guest_memfd: return locked folio from __kvm_gmem_get_pfn
KVM: rename CONFIG_HAVE_KVM_GMEM_* to CONFIG_HAVE_KVM_ARCH_GMEM_*
KVM: guest_memfd: do not go through struct page
KVM: guest_memfd: delay folio_mark_uptodate() until after successful preparation
KVM: guest_memfd: return folio from __kvm_gmem_get_pfn()
KVM: x86: disallow pre-fault for SNP VMs before initialization
KVM: Documentation: Fix title underline too short warning
KVM: x86: Eliminate log spam from limited APIC timer periods
|
|
The unsynchronized_tsc() eventually checks num_possible_cpus(), and if the
system is non-Intel and the number of possible CPUs is greater than one,
assumes that TSCs are unsynchronized. This despite the comment saying
"assume multi socket systems are not synchronized", that is, socket rather
than CPU. This behavior was preserved by commit 8fbbc4b45ce3 ("x86: merge
tsc_init and clocksource code") and by the previous relevant commit
7e69f2b1ead2 ("clocksource: Remove the update callback").
The clocksource drivers were added by commit 5d0cf410e94b ("Time: i386
Clocksource Drivers") back in 2006, and the comment still said "socket"
rather than "CPU".
Therefore, bravely (and perhaps foolishly) make the code match the
comment.
Note that it is possible to bypass both code and comment by booting
with tsc=reliable, but this also disables the clocksource watchdog,
which is undesirable when trust in the TSC is strictly limited.
Reported-by: Zhengxu Chen <[email protected]>
Reported-by: Danielle Costantino <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
* fix latent bug in how usage of large pages is determined for
confidential VMs
* fix "underline too short" in docs
* eliminate log spam from limited APIC timer periods
* disallow pre-faulting of memory before SEV-SNP VMs are initialized
* delay clearing and encrypting private memory until it is added to
guest page tables
* this change also enables another small cleanup: the checks in
SNP_LAUNCH_UPDATE that limit it to non-populated, private pages
can now be moved in the common kvm_gmem_populate() function
|
|
According to the data sheet, writing the MODE register should stop the
counter (and thus the interrupts). This appears to work on real hardware,
at least modern Intel and AMD systems. It should also work on Hyper-V.
However, on some buggy virtual machines the mode change doesn't have any
effect until the counter is subsequently loaded (or perhaps when the IRQ
next fires).
So, set MODE 0 and then load the counter, to ensure that those buggy VMs
do the right thing and the interrupts stop. And then write MODE 0 *again*
to stop the counter on compliant implementations too.
Apparently, Hyper-V keeps firing the IRQ *repeatedly* even in mode zero
when it should only happen once, but the second MODE write stops that too.
Userspace test program (mostly written by tglx):
=====
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdint.h>
#include <sys/io.h>
static __always_inline void __out##bwl(type value, uint16_t port) \
{ \
asm volatile("out" #bwl " %" #bw "0, %w1" \
: : "a"(value), "Nd"(port)); \
} \
\
static __always_inline type __in##bwl(uint16_t port) \
{ \
type value; \
asm volatile("in" #bwl " %w1, %" #bw "0" \
: "=a"(value) : "Nd"(port)); \
return value; \
}
BUILDIO(b, b, uint8_t)
#define inb __inb
#define outb __outb
#define PIT_MODE 0x43
#define PIT_CH0 0x40
#define PIT_CH2 0x42
static int is8254;
static void dump_pit(void)
{
if (is8254) {
// Latch and output counter and status
outb(0xC2, PIT_MODE);
printf("%02x %02x %02x\n", inb(PIT_CH0), inb(PIT_CH0), inb(PIT_CH0));
} else {
// Latch and output counter
outb(0x0, PIT_MODE);
printf("%02x %02x\n", inb(PIT_CH0), inb(PIT_CH0));
}
}
int main(int argc, char* argv[])
{
int nr_counts = 2;
if (argc > 1)
nr_counts = atoi(argv[1]);
if (argc > 2)
is8254 = 1;
if (ioperm(0x40, 4, 1) != 0)
return 1;
dump_pit();
printf("Set oneshot\n");
outb(0x38, PIT_MODE);
outb(0x00, PIT_CH0);
outb(0x0F, PIT_CH0);
dump_pit();
usleep(1000);
dump_pit();
printf("Set periodic\n");
outb(0x34, PIT_MODE);
outb(0x00, PIT_CH0);
outb(0x0F, PIT_CH0);
dump_pit();
usleep(1000);
dump_pit();
dump_pit();
usleep(100000);
dump_pit();
usleep(100000);
dump_pit();
printf("Set stop (%d counter writes)\n", nr_counts);
outb(0x30, PIT_MODE);
while (nr_counts--)
outb(0xFF, PIT_CH0);
dump_pit();
usleep(100000);
dump_pit();
usleep(100000);
dump_pit();
printf("Set MODE 0\n");
outb(0x30, PIT_MODE);
dump_pit();
usleep(100000);
dump_pit();
usleep(100000);
dump_pit();
return 0;
}
=====
Suggested-by: Sean Christopherson <[email protected]>
Co-developed-by: Li RongQing <[email protected]>
Signed-off-by: Li RongQing <[email protected]>
Signed-off-by: David Woodhouse <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Michael Kelley <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
Leaving the PIT interrupt running can cause noticeable steal time for
virtual guests. The VMM generally has a timer which toggles the IRQ input
to the PIC and I/O APIC, which takes CPU time away from the guest. Even
on real hardware, running the counter may use power needlessly (albeit
not much).
Make sure it's turned off if it isn't going to be used.
Signed-off-by: David Woodhouse <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Michael Kelley <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|