Age | Commit message (Collapse) | Author | Files | Lines |
|
Cross-merge bpf fixes after downstream PR including
important fixes (from bpf-next point of view):
commit 41c24102af7b ("selftests/bpf: Filter out _GNU_SOURCE when compiling test_cpp")
commit fdad456cbcca ("bpf: Fix updating attached freplace prog in prog_array map")
No conflicts.
Adjacent changes in:
include/linux/bpf_verifier.h
kernel/bpf/verifier.c
tools/testing/selftests/bpf/Makefile
Link: https://lore.kernel.org/bpf/[email protected]/
Signed-off-by: Alexei Starovoitov <[email protected]>
|
|
If pcie_find_root_port() is unable to locate a Root Port, it will return
NULL. Check the pointer for NULL before dereferencing it.
This particular case is in a quirk for devices that are always below a Root
Port, so this won't avoid a problem and doesn't need to be backported, but
check as a matter of style and to prevent copy/paste mistakes.
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Samasth Norway Ananda <[email protected]>
[bhelgaas: drop Fixes: and explain why there's no problem in this case]
Signed-off-by: Bjorn Helgaas <[email protected]>
Reviewed-by: Mario Limonciello <[email protected]>
|
|
iounmap() on x86 occasionally fails to unmap because the provided valid
ioremap address is not below high_memory. It turned out that this
happens due to KASLR.
KASLR uses the full address space between PAGE_OFFSET and vaddr_end to
randomize the starting points of the direct map, vmalloc and vmemmap
regions. It thereby limits the size of the direct map by using the
installed memory size plus an extra configurable margin for hot-plug
memory. This limitation is done to gain more randomization space
because otherwise only the holes between the direct map, vmalloc,
vmemmap and vaddr_end would be usable for randomizing.
The limited direct map size is not exposed to the rest of the kernel, so
the memory hot-plug and resource management related code paths still
operate under the assumption that the available address space can be
determined with MAX_PHYSMEM_BITS.
request_free_mem_region() allocates from (1 << MAX_PHYSMEM_BITS) - 1
downwards. That means the first allocation happens past the end of the
direct map and if unlucky this address is in the vmalloc space, which
causes high_memory to become greater than VMALLOC_START and consequently
causes iounmap() to fail for valid ioremap addresses.
MAX_PHYSMEM_BITS cannot be changed for that because the randomization
does not align with address bit boundaries and there are other places
which actually require to know the maximum number of address bits. All
remaining usage sites of MAX_PHYSMEM_BITS have been analyzed and found
to be correct.
Cure this by exposing the end of the direct map via PHYSMEM_END and use
that for the memory hot-plug and resource management related places
instead of relying on MAX_PHYSMEM_BITS. In the KASLR case PHYSMEM_END
maps to a variable which is initialized by the KASLR initialization and
otherwise it is based on MAX_PHYSMEM_BITS as before.
To prevent future hickups add a check into add_pages() to catch callers
trying to add memory above PHYSMEM_END.
Fixes: 0483e1fa6e09 ("x86/mm: Implement ASLR for kernel memory regions")
Reported-by: Max Ramanouski <[email protected]>
Reported-by: Alistair Popple <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-By: Max Ramanouski <[email protected]>
Tested-by: Alistair Popple <[email protected]>
Reviewed-by: Dan Williams <[email protected]>
Reviewed-by: Alistair Popple <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/all/87ed6soy3z.ffs@tglx
|
|
In any normal situation this really shouldn't matter, but in case the
address passed in to masked_user_access_begin() were to be some complex
expression, we should evaluate it fully before doing the 'stac'
instruction.
And even without that issue (which objdump would pick up on for any
really bad case), just in general we should strive to minimize the
amount of code we run with user accesses enabled.
For example, even for the trivial pselect6() case, the code generation
(obviously with a non-debug build) just diff with this ends up being
- stac
mov %rax,%rcx
sar $0x3f,%rcx
or %rax,%rcx
+ stac
mov (%rcx),%r13
mov 0x8(%rcx),%r14
clac
so the area delimeted by the 'stac / clac' pair is now literally just
the two user access instructions, and the address generation has been
moved out to before that code.
This will be much more noticeable if we end up deciding that we can go
back to just inlining "get_user()" using the new masked user access
model. The get_user() pointers can often be more complex expressions
involving kernel memory accesses or even function calls.
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The Spectre-v1 mitigations made "access_ok()" much more expensive, since
it has to serialize execution with the test for a valid user address.
All the normal user copy routines avoid this by just masking the user
address with a data-dependent mask instead, but the fast
"unsafe_user_read()" kind of patterms that were supposed to be a fast
case got slowed down.
This introduces a notion of using
src = masked_user_access_begin(src);
to do the user address sanity using a data-dependent mask instead of the
more traditional conditional
if (user_read_access_begin(src, len)) {
model.
This model only works for dense accesses that start at 'src' and on
architectures that have a guard region that is guaranteed to fault in
between the user space and the kernel space area.
With this, the user access doesn't need to be manually checked, because
a bad address is guaranteed to fault (by some architecture masking
trick: on x86-64 this involves just turning an invalid user address into
all ones, since we don't map the top of address space).
This only converts a couple of examples for now. Example x86-64 code
generation for loading two words from user space:
stac
mov %rax,%rcx
sar $0x3f,%rcx
or %rax,%rcx
mov (%rcx),%r13
mov 0x8(%rcx),%r14
clac
where all the error handling and -EFAULT is now purely handled out of
line by the exception path.
Of course, if the micro-architecture does badly at 'clac' and 'stac',
the above is still pitifully slow. But at least we did as well as we
could.
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The Rust compiler added support for `-Zfunction-return=thunk-extern` [1]
in 1.76.0 [2], i.e. the equivalent of `-mfunction-return=thunk-extern`.
Thus add support for `MITIGATION_RETHUNK`.
Without this, `objtool` would warn if enabled for Rust and already warns
under IBT builds, e.g.:
samples/rust/rust_print.o: warning: objtool:
_R...init+0xa5c: 'naked' return found in RETHUNK build
Link: https://github.com/rust-lang/rust/issues/116853 [1]
Link: https://github.com/rust-lang/rust/pull/116892 [2]
Reviewed-by: Gary Guo <[email protected]>
Tested-by: Alice Ryhl <[email protected]>
Tested-by: Benno Lossin <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Link: https://github.com/Rust-for-Linux/linux/issues/945
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Miguel Ojeda <[email protected]>
|
|
Support `MITIGATION_RETPOLINE` by enabling the target features that
Clang does.
The existing target feature being enabled was a leftover from
our old `rust` branch, and it is not enough: the target feature
`retpoline-external-thunk` only implies `retpoline-indirect-calls`, but
not `retpoline-indirect-branches` (see LLVM's `X86.td`), unlike Clang's
flag of the same name `-mretpoline-external-thunk` which does imply both
(see Clang's `lib/Driver/ToolChains/Arch/X86.cpp`).
Without this, `objtool` would complain if enabled for Rust, e.g.:
rust/core.o: warning: objtool:
_R...escape_default+0x13: indirect jump found in RETPOLINE build
In addition, change the comment to note that LLVM is the one disabling
jump tables when retpoline is enabled, thus we do not need to use
`-Zno-jump-tables` for Rust here -- see commit c58f2166ab39 ("Introduce
the "retpoline" x86 mitigation technique ...") [1]:
The goal is simple: avoid generating code which contains an indirect
branch that could have its prediction poisoned by an attacker. In
many cases, the compiler can simply use directed conditional
branches and a small search tree. LLVM already has support for
lowering switches in this way and the first step of this patch is
to disable jump-table lowering of switches and introduce a pass to
rewrite explicit indirectbr sequences into a switch over integers.
As well as a live example at [2].
These should be eventually enabled via `-Ctarget-feature` when `rustc`
starts recognizing them (or via a new dedicated flag) [3].
Cc: Daniel Borkmann <[email protected]>
Link: https://github.com/llvm/llvm-project/commit/c58f2166ab3987f37cb0d7815b561bff5a20a69a [1]
Link: https://godbolt.org/z/G4YPr58qG [2]
Link: https://github.com/rust-lang/rust/issues/116852 [3]
Reviewed-by: Gary Guo <[email protected]>
Tested-by: Alice Ryhl <[email protected]>
Tested-by: Benno Lossin <[email protected]>
Link: https://github.com/Rust-for-Linux/linux/issues/945
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Miguel Ojeda <[email protected]>
|
|
There is already a check for 'asid > MAX_ASID_AVAILABLE' in kern_pcid(), so
it is unnecessary to perform this check in build_cr3() right before calling
kern_pcid().
Remove it.
Signed-off-by: Yuntao Wang <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
There are two distinct CPU features related to the use of XSAVES and LBR:
whether LBR is itself supported and whether XSAVES supports LBR. The LBR
subsystem correctly checks both in intel_pmu_arch_lbr_init(), but the
XSTATE subsystem does not.
The LBR bit is only removed from xfeatures_mask_independent when LBR is not
supported by the CPU, but there is no validation of XSTATE support.
If XSAVES does not support LBR the write to IA32_XSS causes a #GP fault,
leaving the state of IA32_XSS unchanged, i.e. zero. The fault is handled
with a warning and the boot continues.
Consequently the next XRSTORS which tries to restore supervisor state fails
with #GP because the RFBM has zero for all supervisor features, which does
not match the XCOMP_BV field.
As XFEATURE_MASK_FPSTATE includes supervisor features setting up the FPU
causes a #GP, which ends up in fpu_reset_from_exception_fixup(). That fails
due to the same problem resulting in recursive #GPs until the kernel runs
out of stack space and double faults.
Prevent this by storing the supported independent features in
fpu_kernel_cfg during XSTATE initialization and use that cached value for
retrieving the independent feature bits to be written into IA32_XSS.
[ tglx: Massaged change log ]
Fixes: f0dccc9da4c0 ("x86/fpu/xstate: Support dynamic supervisor feature for LBR")
Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Mitchell Levy <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/all/[email protected]
|
|
Disallow read-only memslots for SEV-{ES,SNP} VM types, as KVM can't
directly emulate instructions for ES/SNP, and instead the guest must
explicitly request emulation. Unless the guest explicitly requests
emulation without accessing memory, ES/SNP relies on KVM creating an MMIO
SPTE, with the subsequent #NPF being reflected into the guest as a #VC.
But for read-only memslots, KVM deliberately doesn't create MMIO SPTEs,
because except for ES/SNP, doing so requires setting reserved bits in the
SPTE, i.e. the SPTE can't be readable while also generating a #VC on
writes. Because KVM never creates MMIO SPTEs and jumps directly to
emulation, the guest never gets a #VC. And since KVM simply resumes the
guest if ES/SNP guests trigger emulation, KVM effectively puts the vCPU
into an infinite #NPF loop if the vCPU attempts to write read-only memory.
Disallow read-only memory for all VMs with protected state, i.e. for
upcoming TDX VMs as well as ES/SNP VMs. For TDX, it's actually possible
to support read-only memory, as TDX uses EPT Violation #VE to reflect the
fault into the guest, e.g. KVM could configure read-only SPTEs with RX
protections and SUPPRESS_VE=0. But there is no strong use case for
supporting read-only memslots on TDX, e.g. the main historical usage is
to emulate option ROMs, but TDX disallows executing from shared memory.
And if someone comes along with a legitimate, strong use case, the
restriction can always be lifted for TDX.
Don't bother trying to retroactively apply the restriction to SEV-ES
VMs that are created as type KVM_X86_DEFAULT_VM. Read-only memslots can't
possibly work for SEV-ES, i.e. disallowing such memslots is really just
means reporting an error to userspace instead of silently hanging vCPUs.
Trying to deal with the ordering between KVM_SEV_INIT and memslot creation
isn't worth the marginal benefit it would provide userspace.
Fixes: 26c44aa9e076 ("KVM: SEV: define VM types for SEV and SEV-ES")
Fixes: 1dfe571c12cf ("KVM: SEV: Add initial SEV-SNP support")
Cc: Peter Gonda <[email protected]>
Cc: Michael Roth <[email protected]>
Cc: Vishal Annapurve <[email protected]>
Cc: Ackerly Tng <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Message-ID: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Since commit 4763ed4d4552 ("x86, mm: Clean up and simplify NX enablement")
these declarations is unused and can be removed.
Signed-off-by: Yue Haibing <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
Commit 43fe1abc18a2 ("x86/uv: Use hierarchical irqdomain to manage UV interrupts")
removed the implementation but left declaration.
Signed-off-by: Yue Haibing <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
On 64-bit init_mem_mapping() relies on the minimal page fault handler
provided by the early IDT mechanism. The real page fault handler is
installed right afterwards into the IDT.
This is problematic on CPUs which have X86_FEATURE_FRED set because the
real page fault handler retrieves the faulting address from the FRED
exception stack frame and not from CR2, but that does obviously not work
when FRED is not yet enabled in the CPU.
To prevent this enable FRED right after init_mem_mapping() without
interrupt stacks. Those are enabled later in trap_init() after the CPU
entry area is set up.
[ tglx: Encapsulate the FRED details ]
Fixes: 14619d912b65 ("x86/fred: FRED entry/exit and dispatch code")
Reported-by: Hou Wenlong <[email protected]>
Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Xin Li (Intel) <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
To enable FRED earlier, move the RSP initialization out of
cpu_init_fred_exceptions() into cpu_init_fred_rsps().
This is required as the FRED RSP initialization depends on the availability
of the CPU entry areas which are set up late in trap_init(),
No functional change intended. Marked with Fixes as it's a depedency for
the real fix.
Fixes: 14619d912b65 ("x86/fred: FRED entry/exit and dispatch code")
Signed-off-by: Xin Li (Intel) <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
Depending on whether FRED is enabled, sysvec_install() installs a system
interrupt handler into either into FRED's system vector dispatch table or
into the IDT.
However FRED can be disabled later in trap_init(), after sysvec_install()
has been invoked already; e.g., the HYPERVISOR_CALLBACK_VECTOR handler is
registered with sysvec_install() in kvm_guest_init(), which is called in
setup_arch() but way before trap_init().
IOW, there is a gap between FRED is available and available but disabled.
As a result, when FRED is available but disabled, early sysvec_install()
invocations fail to install the IDT handler resulting in spurious
interrupts.
Fix it by parsing cmdline param "fred=" in cpu_parse_early_param() to
ensure that FRED is disabled before the first sysvec_install() incovations.
Fixes: 3810da12710a ("x86/fred: Add a fred= cmdline param")
Reported-by: Hou Wenlong <[email protected]>
Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Xin Li (Intel) <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
Ignore the userspace provided x2APIC ID when fixing up APIC state for
KVM_SET_LAPIC, i.e. make the x2APIC fully readonly in KVM. Commit
a92e2543d6a8 ("KVM: x86: use hardware-compatible format for APIC ID
register"), which added the fixup, didn't intend to allow userspace to
modify the x2APIC ID. In fact, that commit is when KVM first started
treating the x2APIC ID as readonly, apparently to fix some race:
static inline u32 kvm_apic_id(struct kvm_lapic *apic)
{
- return (kvm_lapic_get_reg(apic, APIC_ID) >> 24) & 0xff;
+ /* To avoid a race between apic_base and following APIC_ID update when
+ * switching to x2apic_mode, the x2apic mode returns initial x2apic id.
+ */
+ if (apic_x2apic_mode(apic))
+ return apic->vcpu->vcpu_id;
+
+ return kvm_lapic_get_reg(apic, APIC_ID) >> 24;
}
Furthermore, KVM doesn't support delivering interrupts to vCPUs with a
modified x2APIC ID, but KVM *does* return the modified value on a guest
RDMSR and for KVM_GET_LAPIC. I.e. no remotely sane setup can actually
work with a modified x2APIC ID.
Making the x2APIC ID fully readonly fixes a WARN in KVM's optimized map
calculation, which expects the LDR to align with the x2APIC ID.
WARNING: CPU: 2 PID: 958 at arch/x86/kvm/lapic.c:331 kvm_recalculate_apic_map+0x609/0xa00 [kvm]
CPU: 2 PID: 958 Comm: recalc_apic_map Not tainted 6.4.0-rc3-vanilla+ #35
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.2-1-1 04/01/2014
RIP: 0010:kvm_recalculate_apic_map+0x609/0xa00 [kvm]
Call Trace:
<TASK>
kvm_apic_set_state+0x1cf/0x5b0 [kvm]
kvm_arch_vcpu_ioctl+0x1806/0x2100 [kvm]
kvm_vcpu_ioctl+0x663/0x8a0 [kvm]
__x64_sys_ioctl+0xb8/0xf0
do_syscall_64+0x56/0x80
entry_SYSCALL_64_after_hwframe+0x46/0xb0
RIP: 0033:0x7fade8b9dd6f
Unfortunately, the WARN can still trigger for other CPUs than the current
one by racing against KVM_SET_LAPIC, so remove it completely.
Reported-by: Michal Luczaj <[email protected]>
Closes: https://lore.kernel.org/all/[email protected]
Reported-by: Haoyu Wu <[email protected]>
Closes: https://lore.kernel.org/all/[email protected]
Reported-by: [email protected]
Closes: https://lore.kernel.org/all/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
Message-ID: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Use this_cpu_ptr() instead of open coding the equivalent in various
user return MSR helpers.
Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Chao Gao <[email protected]>
Reviewed-by: Yuan Yao <[email protected]>
[sean: massage changelog]
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Pankaj Gupta <[email protected]>
Message-ID: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
There is no caller in tree since introduction in commit b4f69df0f65e ("KVM:
x86: Make Hyper-V emulation optional")
Signed-off-by: Yue Haibing <[email protected]>
Message-ID: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
x2apic_disable() clears x2apic_state and x2apic_mode unconditionally, even
when the state is X2APIC_ON_LOCKED, which prevents the kernel to disable
it thereby creating inconsistent state.
Due to the early state check for X2APIC_ON, the code path which warns about
a locked X2APIC cannot be reached.
Test for state < X2APIC_ON instead and move the clearing of the state and
mode variables to the place which actually disables X2APIC.
[ tglx: Massaged change log. Added Fixes tag. Moved clearing so it's at the
right place for back ports ]
Fixes: a57e456a7b28 ("x86/apic: Fix fallout from x2apic cleanup")
Signed-off-by: Yuntao Wang <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/all/[email protected]
|
|
The copy_from_user() function returns the number of bytes which it
was not able to copy. Return -EFAULT instead.
Fixes: dee5a47cc7a4 ("KVM: SEV: Add KVM_SEV_SNP_LAUNCH_UPDATE command")
Signed-off-by: Dan Carpenter <[email protected]>
Message-ID: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
If snp_lookup_rmpentry() fails then "assigned" is printed in the error
message but it was never initialized. Initialize it to false.
Fixes: dee5a47cc7a4 ("KVM: SEV: Add KVM_SEV_SNP_LAUNCH_UPDATE command")
Signed-off-by: Dan Carpenter <[email protected]>
Message-ID: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
commit f5a3562ec9dd ("x86/irq: Reserve a per CPU IDT vector for posted
MSIs") changed the first system vector from LOCAL_TIMER_VECTOR to
POSTED_MSI_NOTIFICATION_VECTOR. Reflect this change in the vector layout
comment as well.
However, instead of pointing to the specific vector, use the
FIRST_SYSTEM_VECTOR indirection which essentially refers to the same.
This avoids unnecessary modifications to the same comment whenever
additional system vectors get added.
Signed-off-by: Sohil Mehta <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Jacob Pan <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
The removal of get_physical_broadcast(), generic_bigsmp_probe() and
default_apic_id_valid() left the stale declarations around.
Remove them.
Signed-off-by: Yue Haibing <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
For any changes of struct fd representation we need to
turn existing accesses to fields into calls of wrappers.
Accesses to struct fd::flags are very few (3 in linux/file.h,
1 in net/socket.c, 3 in fs/overlayfs/file.c and 3 more in
explicit initializers).
Those can be dealt with in the commit converting to
new layout; accesses to struct fd::file are too many for that.
This commit converts (almost) all of f.file to
fd_file(f). It's not entirely mechanical ('file' is used as
a member name more than just in struct fd) and it does not
even attempt to distinguish the uses in pointer context from
those in boolean context; the latter will be eventually turned
into a separate helper (fd_empty()).
NOTE: mass conversion to fd_empty(), tempting as it
might be, is a bad idea; better do that piecewise in commit
that convert from fdget...() to CLASS(...).
[conflicts in fs/fhandle.c, kernel/bpf/syscall.c, mm/memcontrol.c
caught by git; fs/stat.c one got caught by git grep]
[fs/xattr.c conflict]
Reviewed-by: Christian Brauner <[email protected]>
Signed-off-by: Al Viro <[email protected]>
|
|
Structural Based Functional Test at Field (SBAF) is a new type of
testing that provides comprehensive core test coverage complementing
existing IFS tests like Scan at Field (SAF) or ArrayBist.
SBAF device will appear as a new device instance (intel_ifs_2) under
/sys/devices/virtual/misc. The user interaction necessary to load the
test image and test a particular core is the same as the existing scan
test (intel_ifs_0).
During the loading stage, the driver will look for a file named
ff-mm-ss-<batch02x>.sbft in the /lib/firmware/intel/ifs_2 directory.
The hardware interaction needed for loading the image is similar to
SAF, with the only difference being the MSR addresses used. Reuse the
SAF image loading code, passing the SBAF-specific MSR addresses via
struct ifs_test_msrs in the driver device data.
Unlike SAF, the SBAF test image chunks are further divided into smaller
logical entities called bundles. Since the SBAF test is initiated per
bundle, cache the maximum number of bundles in the current image, which
is used for iterating through bundles during SBAF test execution.
Reviewed-by: Ashok Raj <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Reviewed-by: Ilpo Järvinen <[email protected]>
Signed-off-by: Jithu Joseph <[email protected]>
Co-developed-by: Kuppuswamy Sathyanarayanan <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
Link: https://lore.kernel.org/r/20240801051814.1935149-3-sathyanarayanan.kuppuswamy@linux.intel.com
Signed-off-by: Hans de Goede <[email protected]>
|
|
Commit 6fd166aae78c ("x86/mm: Use/Fix PCID to optimize user/kernel
switches") removed the last usage of CR3_HW_ASID_BITS and opted to use
X86_CR3_PCID_BITS instead. Remove CR3_HW_ASID_BITS.
Signed-off-by: Yosry Ahmed <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
On PREEMPT_RT, kfree() takes sleeping locks and must not be called with
preemption disabled. Therefore, on PREEMPT_RT skcipher_walk_done() must
not be called from within a kernel_fpu_{begin,end}() pair, even when
it's the last call which is guaranteed to not allocate memory.
Therefore, move the last skcipher_walk_done() in gcm_crypt() to the end
of the function so that it goes after the kernel_fpu_end(). To make
this work cleanly, rework the data processing loop to handle only
non-last data segments.
Fixes: b06affb1cb58 ("crypto: x86/aes-gcm - add VAES and AVX512 / AVX10 optimized AES-GCM")
Reported-by: Sebastian Andrzej Siewior <[email protected]>
Closes: https://lore.kernel.org/linux-crypto/[email protected]
Signed-off-by: Eric Biggers <[email protected]>
Tested-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Herbert Xu <[email protected]>
|
|
Logical destination mode of the local APIC is used for systems with up to
8 CPUs. It has an advantage over physical destination mode as it allows to
target multiple CPUs at once with IPIs.
That advantage was definitely worth it when systems with up to 8 CPUs
were state of the art for servers and workstations, but that's history.
Aside of that there are systems which fail to work with logical destination
mode as the ACPI/DMI quirks show and there are AMD Zen1 systems out there
which fail when interrupt remapping is enabled as reported by Rob and
Christian. The latter problem can be cured by firmware updates, but not all
OEMs distribute the required changes.
Physical destination mode is guaranteed to work because it is the only way
to get a CPU up and running via the INIT/INIT/STARTUP sequence.
As the number of CPUs keeps increasing, logical destination mode becomes a
less used code path so there is no real good reason to keep it around.
Therefore remove logical destination mode support for 64-bit and default to
physical destination mode.
Reported-by: Rob Newcater <[email protected]>
Reported-by: Christian Heusel <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Borislav Petkov (AMD) <[email protected]>
Tested-by: Rob Newcater <[email protected]>
Link: https://lore.kernel.org/all/877cd5u671.ffs@tglx
|
|
Stack unwinding produces large amounts of uninteresting coverage.
It's called from KASAN kmalloc/kfree hooks, fault injection, etc.
It's not particularly useful and is not a function of system call args.
Ignore that code.
Signed-off-by: Dmitry Vyukov <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Alexander Potapenko <[email protected]>
Reviewed-by: Marco Elver <[email protected]>
Reviewed-by: Andrey Konovalov <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/all/eaf54b8634970b73552dcd38bf9be6ef55238c10.1718092070.git.dvyukov@google.com
|
|
common_interrupt() and related variants call kvm_set_cpu_l1tf_flush_l1d(),
which is neither marked noinstr nor __always_inline.
So compiler puts it out of line and adds instrumentation to it. Since the
call is inside of instrumentation_begin/end(), objtool does not warn about
it.
The manifestation is that KCOV produces spurious coverage in
kvm_set_cpu_l1tf_flush_l1d() in random places because the call happens when
preempt count is not yet updated to say that the kernel is in an interrupt.
Mark kvm_set_cpu_l1tf_flush_l1d() as __always_inline and move it out of the
instrumentation_begin/end() section. It only calls __this_cpu_write()
which is already safe to call in noinstr contexts.
Fixes: 6368558c3710 ("x86/entry: Provide IDTENTRY_SYSVEC")
Signed-off-by: Dmitry Vyukov <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Alexander Potapenko <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/all/3f9a1de9e415fcb53d07dc9e19fa8481bb021b1b.1718092070.git.dvyukov@google.com
|
|
This per CPU log is becoming longer with more and more CPUs in system,
which slows down the boot process due to the serializing nature of
printk().
The value of this information is dubious and it can be retrieved by lscpu
from user space if required..
Downgrade the printk() to pr_debug() so it is still accessible for debug
purposes.
[ tglx: Massaged changelog ]
Signed-off-by: Li RongQing <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
MTRRs have an obsolete fixed variant for fine grained caching control
of the 640K-1MB region that uses separate MSRs. This fixed variant has
a separate capability bit in the MTRR capability MSR.
So far all x86 CPUs which support MTRR have this separate bit set, so it
went unnoticed that mtrr_save_state() does not check the capability bit
before accessing the fixed MTRR MSRs.
Though on a CPU that does not support the fixed MTRR capability this
results in a #GP. The #GP itself is harmless because the RDMSR fault is
handled gracefully, but results in a WARN_ON().
Add the missing capability check to prevent this.
Fixes: 2b1f6278d77c ("[PATCH] x86: Save the MTRRs of the BSP before booting an AP")
Signed-off-by: Andi Kleen <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/all/[email protected]
|
|
The kernel can change spinlock behavior when running as a guest. But this
guest-friendly behavior causes performance problems on bare metal.
The kernel uses a static key to switch between the two modes.
In theory, the static key is enabled by default (run in guest mode) and
should be disabled for bare metal (and in some guests that want native
behavior or paravirt spinlock).
A performance drop is reported when running encode/decode workload and
BenchSEE cache sub-workload.
Bisect points to commit ce0a1b608bfc ("x86/paravirt: Silence unused
native_pv_lock_init() function warning"). When CONFIG_PARAVIRT_SPINLOCKS is
disabled the virt_spin_lock_key is incorrectly set to true on bare
metal. The qspinlock degenerates to test-and-set spinlock, which decreases
the performance on bare metal.
Set the default value of virt_spin_lock_key to false. If booting in a VM,
enable this key. Later during the VM initialization, if other
high-efficient spinlock is preferred (e.g. paravirt-spinlock), or the user
wants the native qspinlock (via nopvspin boot commandline), the
virt_spin_lock_key is disabled accordingly.
This results in the following decision matrix:
X86_FEATURE_HYPERVISOR Y Y Y N
CONFIG_PARAVIRT_SPINLOCKS Y Y N Y/N
PV spinlock Y N N Y/N
virt_spin_lock_key N Y/N Y N
Fixes: ce0a1b608bfc ("x86/paravirt: Silence unused native_pv_lock_init() function warning")
Reported-by: Prem Nath Dey <[email protected]>
Reported-by: Xiaoping Zhou <[email protected]>
Suggested-by: Dave Hansen <[email protected]>
Suggested-by: Qiuxu Zhuo <[email protected]>
Suggested-by: Nikolay Borisov <[email protected]>
Signed-off-by: Chen Yu <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Nikolay Borisov <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/all/[email protected]
|
|
On a platform using the "Multiprocessor Wakeup Structure"[1] to startup
secondary CPUs the control processor needs to memremap() the physical
address of the MP Wakeup Structure mailbox to the variable
acpi_mp_wake_mailbox, which holds the virtual address of mailbox.
To wake up the AP the control processor writes the APIC ID of AP, the
wakeup vector and the ACPI_MP_WAKE_COMMAND_WAKEUP command into the mailbox.
Current implementation doesn't consider the case which restricts boot time
CPU bringup to 1 with the kernel parameter "maxcpus=1" and brings other
CPUs online later from user space as it sets acpi_mp_wake_mailbox to
read-only after init. So when the first AP is tried to brought online
after init, the attempt to update the variable results in a kernel panic.
The memremap() call that initializes the variable cannot be moved into
acpi_parse_mp_wake() because memremap() is not functional at that point in
the boot process. Also as the APs might never be brought up, keep the
memremap() call in acpi_wakeup_cpu() so that the operation only takes place
when needed.
Fixes: 24dd05da8c79 ("x86/apic: Mark acpi_mp_wake_* variables as __ro_after_init")
Signed-off-by: Zhiquan Li <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Kirill A. Shutemov <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
Commit 2744a7ce34a7 ("x86/apic: Replace acpi_wake_cpu_handler_update() and
apic_set_eoi_cb()") left it unused.
Signed-off-by: Yue Haibing <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
Add missing new lines and reorder variable definitions.
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Qiuxu Zhuo <[email protected]>
Tested-by: Breno Leitao <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
80 character limit is history.
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Qiuxu Zhuo <[email protected]>
Tested-by: Breno Leitao <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
Add brackets around if/for constructs as required by coding style or remove
pointless line breaks to make it true single line statements which do not
require brackets.
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Qiuxu Zhuo <[email protected]>
Tested-by: Breno Leitao <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
Use proper comment styles and shrink comments to their scope where
applicable.
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Qiuxu Zhuo <[email protected]>
Tested-by: Breno Leitao <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
It's only used by check_timer().
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Qiuxu Zhuo <[email protected]>
Tested-by: Breno Leitao <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
Use the new apic_pr_verbose() helper.
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Qiuxu Zhuo <[email protected]>
Tested-by: Breno Leitao <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
Cleanup the APIC printk()s which are inside of a apic verbosity guarded
region by using apic_dbg() for the KERN_DEBUG level prints.
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Qiuxu Zhuo <[email protected]>
Tested-by: Breno Leitao <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
Replace apic_printk($LEVEL) with the corresponding apic_pr_*() helpers and
use pr_info() for APIC_QUIET as that is always printed so the indirection
is pointless noise.
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Qiuxu Zhuo <[email protected]>
Tested-by: Breno Leitao <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
Use the new apic_pr_*() helpers and cleanup the apic_printk() maze.
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Qiuxu Zhuo <[email protected]>
Tested-by: Breno Leitao <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
apic_printk() requires the APIC verbosity level and printk level which is
tedious and horrible to read. Provide helpers to simplify all of that.
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Qiuxu Zhuo <[email protected]>
Tested-by: Breno Leitao <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
KISS rules!
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Qiuxu Zhuo <[email protected]>
Tested-by: Breno Leitao <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
Make them conforming to the TIP coding style guide.
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Qiuxu Zhuo <[email protected]>
Tested-by: Breno Leitao <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
Only invoked from check_timer() which is __init too. Cleanup the variable
declaration while at it.
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Qiuxu Zhuo <[email protected]>
Tested-by: Breno Leitao <[email protected]>
Reviewed-by: Breno Leitao <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
Breno observed panics when using failslab under certain conditions during
runtime:
can not alloc irq_pin_list (-1,0,20)
Kernel panic - not syncing: IO-APIC: failed to add irq-pin. Can not proceed
panic+0x4e9/0x590
mp_irqdomain_alloc+0x9ab/0xa80
irq_domain_alloc_irqs_locked+0x25d/0x8d0
__irq_domain_alloc_irqs+0x80/0x110
mp_map_pin_to_irq+0x645/0x890
acpi_register_gsi_ioapic+0xe6/0x150
hpet_open+0x313/0x480
That's a pointless panic which is a leftover of the historic IO/APIC code
which panic'ed during early boot when the interrupt allocation failed.
The only place which might justify panic is the PIT/HPET timer_check() code
which tries to figure out whether the timer interrupt is delivered through
the IO/APIC. But that code does not require to handle interrupt allocation
failures. If the interrupt cannot be allocated then timer delivery fails
and it either panics due to that or falls back to legacy mode.
Cure this by removing the panic wrapper around __add_pin_to_irq_node() and
making mp_irqdomain_alloc() aware of the failure condition and handle it as
any other failure in this function gracefully.
Reported-by: Breno Leitao <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Breno Leitao <[email protected]>
Tested-by: Qiuxu Zhuo <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
Link: https://lore.kernel.org/all/[email protected]
|
|
So it turns out that we have to do two passes of
pti_clone_entry_text(), once before initcalls, such that device and
late initcalls can use user-mode-helper / modprobe and once after
free_initmem() / mark_readonly().
Now obviously mark_readonly() can cause PMD splits, and
pti_clone_pgtable() doesn't like that much.
Allow the late clone to split PMDs so that pagetables stay in sync.
[peterz: Changelog and comments]
Reported-by: Guenter Roeck <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Guenter Roeck <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|