aboutsummaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)AuthorFilesLines
2024-08-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfAlexei Starovoitov31-158/+217
Cross-merge bpf fixes after downstream PR including important fixes (from bpf-next point of view): commit 41c24102af7b ("selftests/bpf: Filter out _GNU_SOURCE when compiling test_cpp") commit fdad456cbcca ("bpf: Fix updating attached freplace prog in prog_array map") No conflicts. Adjacent changes in: include/linux/bpf_verifier.h kernel/bpf/verifier.c tools/testing/selftests/bpf/Makefile Link: https://lore.kernel.org/bpf/[email protected]/ Signed-off-by: Alexei Starovoitov <[email protected]>
2024-08-21x86/PCI: Check pcie_find_root_port() return for NULLSamasth Norway Ananda1-2/+2
If pcie_find_root_port() is unable to locate a Root Port, it will return NULL. Check the pointer for NULL before dereferencing it. This particular case is in a quirk for devices that are always below a Root Port, so this won't avoid a problem and doesn't need to be backported, but check as a matter of style and to prevent copy/paste mistakes. Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Samasth Norway Ananda <[email protected]> [bhelgaas: drop Fixes: and explain why there's no problem in this case] Signed-off-by: Bjorn Helgaas <[email protected]> Reviewed-by: Mario Limonciello <[email protected]>
2024-08-20x86/kaslr: Expose and use the end of the physical memory address spaceThomas Gleixner4-6/+35
iounmap() on x86 occasionally fails to unmap because the provided valid ioremap address is not below high_memory. It turned out that this happens due to KASLR. KASLR uses the full address space between PAGE_OFFSET and vaddr_end to randomize the starting points of the direct map, vmalloc and vmemmap regions. It thereby limits the size of the direct map by using the installed memory size plus an extra configurable margin for hot-plug memory. This limitation is done to gain more randomization space because otherwise only the holes between the direct map, vmalloc, vmemmap and vaddr_end would be usable for randomizing. The limited direct map size is not exposed to the rest of the kernel, so the memory hot-plug and resource management related code paths still operate under the assumption that the available address space can be determined with MAX_PHYSMEM_BITS. request_free_mem_region() allocates from (1 << MAX_PHYSMEM_BITS) - 1 downwards. That means the first allocation happens past the end of the direct map and if unlucky this address is in the vmalloc space, which causes high_memory to become greater than VMALLOC_START and consequently causes iounmap() to fail for valid ioremap addresses. MAX_PHYSMEM_BITS cannot be changed for that because the randomization does not align with address bit boundaries and there are other places which actually require to know the maximum number of address bits. All remaining usage sites of MAX_PHYSMEM_BITS have been analyzed and found to be correct. Cure this by exposing the end of the direct map via PHYSMEM_END and use that for the memory hot-plug and resource management related places instead of relying on MAX_PHYSMEM_BITS. In the KASLR case PHYSMEM_END maps to a variable which is initialized by the KASLR initialization and otherwise it is based on MAX_PHYSMEM_BITS as before. To prevent future hickups add a check into add_pages() to catch callers trying to add memory above PHYSMEM_END. Fixes: 0483e1fa6e09 ("x86/mm: Implement ASLR for kernel memory regions") Reported-by: Max Ramanouski <[email protected]> Reported-by: Alistair Popple <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-By: Max Ramanouski <[email protected]> Tested-by: Alistair Popple <[email protected]> Reviewed-by: Dan Williams <[email protected]> Reviewed-by: Alistair Popple <[email protected]> Reviewed-by: Kees Cook <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/all/87ed6soy3z.ffs@tglx
2024-08-19x86: do the user address masking outside the user access areaLinus Torvalds1-1/+3
In any normal situation this really shouldn't matter, but in case the address passed in to masked_user_access_begin() were to be some complex expression, we should evaluate it fully before doing the 'stac' instruction. And even without that issue (which objdump would pick up on for any really bad case), just in general we should strive to minimize the amount of code we run with user accesses enabled. For example, even for the trivial pselect6() case, the code generation (obviously with a non-debug build) just diff with this ends up being - stac mov %rax,%rcx sar $0x3f,%rcx or %rax,%rcx + stac mov (%rcx),%r13 mov 0x8(%rcx),%r14 clac so the area delimeted by the 'stac / clac' pair is now literally just the two user access instructions, and the address generation has been moved out to before that code. This will be much more noticeable if we end up deciding that we can go back to just inlining "get_user()" using the new masked user access model. The get_user() pointers can often be more complex expressions involving kernel memory accesses or even function calls. Signed-off-by: Linus Torvalds <[email protected]>
2024-08-19x86: support user address masking instead of non-speculative conditionalLinus Torvalds1-0/+8
The Spectre-v1 mitigations made "access_ok()" much more expensive, since it has to serialize execution with the test for a valid user address. All the normal user copy routines avoid this by just masking the user address with a data-dependent mask instead, but the fast "unsafe_user_read()" kind of patterms that were supposed to be a fast case got slowed down. This introduces a notion of using src = masked_user_access_begin(src); to do the user address sanity using a data-dependent mask instead of the more traditional conditional if (user_read_access_begin(src, len)) { model. This model only works for dense accesses that start at 'src' and on architectures that have a guard region that is guaranteed to fault in between the user space and the kernel space area. With this, the user access doesn't need to be manually checked, because a bad address is guaranteed to fault (by some architecture masking trick: on x86-64 this involves just turning an invalid user address into all ones, since we don't map the top of address space). This only converts a couple of examples for now. Example x86-64 code generation for loading two words from user space: stac mov %rax,%rcx sar $0x3f,%rcx or %rax,%rcx mov (%rcx),%r13 mov 0x8(%rcx),%r14 clac where all the error handling and -EFAULT is now purely handled out of line by the exception path. Of course, if the micro-architecture does badly at 'clac' and 'stac', the above is still pitifully slow. But at least we did as well as we could. Signed-off-by: Linus Torvalds <[email protected]>
2024-08-18x86/rust: support MITIGATION_RETHUNKMiguel Ojeda1-0/+5
The Rust compiler added support for `-Zfunction-return=thunk-extern` [1] in 1.76.0 [2], i.e. the equivalent of `-mfunction-return=thunk-extern`. Thus add support for `MITIGATION_RETHUNK`. Without this, `objtool` would warn if enabled for Rust and already warns under IBT builds, e.g.: samples/rust/rust_print.o: warning: objtool: _R...init+0xa5c: 'naked' return found in RETHUNK build Link: https://github.com/rust-lang/rust/issues/116853 [1] Link: https://github.com/rust-lang/rust/pull/116892 [2] Reviewed-by: Gary Guo <[email protected]> Tested-by: Alice Ryhl <[email protected]> Tested-by: Benno Lossin <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Link: https://github.com/Rust-for-Linux/linux/issues/945 Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Miguel Ojeda <[email protected]>
2024-08-18x86/rust: support MITIGATION_RETPOLINEMiguel Ojeda1-1/+1
Support `MITIGATION_RETPOLINE` by enabling the target features that Clang does. The existing target feature being enabled was a leftover from our old `rust` branch, and it is not enough: the target feature `retpoline-external-thunk` only implies `retpoline-indirect-calls`, but not `retpoline-indirect-branches` (see LLVM's `X86.td`), unlike Clang's flag of the same name `-mretpoline-external-thunk` which does imply both (see Clang's `lib/Driver/ToolChains/Arch/X86.cpp`). Without this, `objtool` would complain if enabled for Rust, e.g.: rust/core.o: warning: objtool: _R...escape_default+0x13: indirect jump found in RETPOLINE build In addition, change the comment to note that LLVM is the one disabling jump tables when retpoline is enabled, thus we do not need to use `-Zno-jump-tables` for Rust here -- see commit c58f2166ab39 ("Introduce the "retpoline" x86 mitigation technique ...") [1]: The goal is simple: avoid generating code which contains an indirect branch that could have its prediction poisoned by an attacker. In many cases, the compiler can simply use directed conditional branches and a small search tree. LLVM already has support for lowering switches in this way and the first step of this patch is to disable jump-table lowering of switches and introduce a pass to rewrite explicit indirectbr sequences into a switch over integers. As well as a live example at [2]. These should be eventually enabled via `-Ctarget-feature` when `rustc` starts recognizing them (or via a new dedicated flag) [3]. Cc: Daniel Borkmann <[email protected]> Link: https://github.com/llvm/llvm-project/commit/c58f2166ab3987f37cb0d7815b561bff5a20a69a [1] Link: https://godbolt.org/z/G4YPr58qG [2] Link: https://github.com/rust-lang/rust/issues/116852 [3] Reviewed-by: Gary Guo <[email protected]> Tested-by: Alice Ryhl <[email protected]> Tested-by: Benno Lossin <[email protected]> Link: https://github.com/Rust-for-Linux/linux/issues/945 Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Miguel Ojeda <[email protected]>
2024-08-14x86/mm: Remove duplicate check from build_cr3()Yuntao Wang1-1/+0
There is already a check for 'asid > MAX_ASID_AVAILABLE' in kern_pcid(), so it is unnecessary to perform this check in build_cr3() right before calling kern_pcid(). Remove it. Signed-off-by: Yuntao Wang <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-14x86/fpu: Avoid writing LBR bit to IA32_XSS unless supportedMitchell Levy3-2/+12
There are two distinct CPU features related to the use of XSAVES and LBR: whether LBR is itself supported and whether XSAVES supports LBR. The LBR subsystem correctly checks both in intel_pmu_arch_lbr_init(), but the XSTATE subsystem does not. The LBR bit is only removed from xfeatures_mask_independent when LBR is not supported by the CPU, but there is no validation of XSTATE support. If XSAVES does not support LBR the write to IA32_XSS causes a #GP fault, leaving the state of IA32_XSS unchanged, i.e. zero. The fault is handled with a warning and the boot continues. Consequently the next XRSTORS which tries to restore supervisor state fails with #GP because the RFBM has zero for all supervisor features, which does not match the XCOMP_BV field. As XFEATURE_MASK_FPSTATE includes supervisor features setting up the FPU causes a #GP, which ends up in fpu_reset_from_exception_fixup(). That fails due to the same problem resulting in recursive #GPs until the kernel runs out of stack space and double faults. Prevent this by storing the supported independent features in fpu_kernel_cfg during XSTATE initialization and use that cached value for retrieving the independent feature bits to be written into IA32_XSS. [ tglx: Massaged change log ] Fixes: f0dccc9da4c0 ("x86/fpu/xstate: Support dynamic supervisor feature for LBR") Suggested-by: Thomas Gleixner <[email protected]> Signed-off-by: Mitchell Levy <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/all/[email protected]
2024-08-14KVM: x86: Disallow read-only memslots for SEV-ES and SEV-SNP (and TDX)Sean Christopherson1-0/+2
Disallow read-only memslots for SEV-{ES,SNP} VM types, as KVM can't directly emulate instructions for ES/SNP, and instead the guest must explicitly request emulation. Unless the guest explicitly requests emulation without accessing memory, ES/SNP relies on KVM creating an MMIO SPTE, with the subsequent #NPF being reflected into the guest as a #VC. But for read-only memslots, KVM deliberately doesn't create MMIO SPTEs, because except for ES/SNP, doing so requires setting reserved bits in the SPTE, i.e. the SPTE can't be readable while also generating a #VC on writes. Because KVM never creates MMIO SPTEs and jumps directly to emulation, the guest never gets a #VC. And since KVM simply resumes the guest if ES/SNP guests trigger emulation, KVM effectively puts the vCPU into an infinite #NPF loop if the vCPU attempts to write read-only memory. Disallow read-only memory for all VMs with protected state, i.e. for upcoming TDX VMs as well as ES/SNP VMs. For TDX, it's actually possible to support read-only memory, as TDX uses EPT Violation #VE to reflect the fault into the guest, e.g. KVM could configure read-only SPTEs with RX protections and SUPPRESS_VE=0. But there is no strong use case for supporting read-only memslots on TDX, e.g. the main historical usage is to emulate option ROMs, but TDX disallows executing from shared memory. And if someone comes along with a legitimate, strong use case, the restriction can always be lifted for TDX. Don't bother trying to retroactively apply the restriction to SEV-ES VMs that are created as type KVM_X86_DEFAULT_VM. Read-only memslots can't possibly work for SEV-ES, i.e. disallowing such memslots is really just means reporting an error to userspace instead of silently hanging vCPUs. Trying to deal with the ordering between KVM_SEV_INIT and memslot creation isn't worth the marginal benefit it would provide userspace. Fixes: 26c44aa9e076 ("KVM: SEV: define VM types for SEV and SEV-ES") Fixes: 1dfe571c12cf ("KVM: SEV: Add initial SEV-SNP support") Cc: Peter Gonda <[email protected]> Cc: Michael Roth <[email protected]> Cc: Vishal Annapurve <[email protected]> Cc: Ackerly Tng <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Message-ID: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-08-14x86/mm: Remove unused NX related declarationsYue Haibing1-2/+0
Since commit 4763ed4d4552 ("x86, mm: Clean up and simplify NX enablement") these declarations is unused and can be removed. Signed-off-by: Yue Haibing <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-14x86/platform/uv: Remove unused declaration uv_irq_2_mmr_info()Yue Haibing1-1/+0
Commit 43fe1abc18a2 ("x86/uv: Use hierarchical irqdomain to manage UV interrupts") removed the implementation but left declaration. Signed-off-by: Yue Haibing <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-13x86/fred: Enable FRED right after init_mem_mapping()Xin Li (Intel)5-6/+23
On 64-bit init_mem_mapping() relies on the minimal page fault handler provided by the early IDT mechanism. The real page fault handler is installed right afterwards into the IDT. This is problematic on CPUs which have X86_FEATURE_FRED set because the real page fault handler retrieves the faulting address from the FRED exception stack frame and not from CR2, but that does obviously not work when FRED is not yet enabled in the CPU. To prevent this enable FRED right after init_mem_mapping() without interrupt stacks. Those are enabled later in trap_init() after the CPU entry area is set up. [ tglx: Encapsulate the FRED details ] Fixes: 14619d912b65 ("x86/fred: FRED entry/exit and dispatch code") Reported-by: Hou Wenlong <[email protected]> Suggested-by: Thomas Gleixner <[email protected]> Signed-off-by: Xin Li (Intel) <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-13x86/fred: Move FRED RSP initialization into a separate functionXin Li (Intel)3-11/+25
To enable FRED earlier, move the RSP initialization out of cpu_init_fred_exceptions() into cpu_init_fred_rsps(). This is required as the FRED RSP initialization depends on the availability of the CPU entry areas which are set up late in trap_init(), No functional change intended. Marked with Fixes as it's a depedency for the real fix. Fixes: 14619d912b65 ("x86/fred: FRED entry/exit and dispatch code") Signed-off-by: Xin Li (Intel) <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-13x86/fred: Parse cmdline param "fred=" in cpu_parse_early_param()Xin Li (Intel)2-26/+5
Depending on whether FRED is enabled, sysvec_install() installs a system interrupt handler into either into FRED's system vector dispatch table or into the IDT. However FRED can be disabled later in trap_init(), after sysvec_install() has been invoked already; e.g., the HYPERVISOR_CALLBACK_VECTOR handler is registered with sysvec_install() in kvm_guest_init(), which is called in setup_arch() but way before trap_init(). IOW, there is a gap between FRED is available and available but disabled. As a result, when FRED is available but disabled, early sysvec_install() invocations fail to install the IDT handler resulting in spurious interrupts. Fix it by parsing cmdline param "fred=" in cpu_parse_early_param() to ensure that FRED is disabled before the first sysvec_install() incovations. Fixes: 3810da12710a ("x86/fred: Add a fred= cmdline param") Reported-by: Hou Wenlong <[email protected]> Suggested-by: Thomas Gleixner <[email protected]> Signed-off-by: Xin Li (Intel) <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-13KVM: x86: Make x2APIC ID 100% readonlySean Christopherson1-7/+15
Ignore the userspace provided x2APIC ID when fixing up APIC state for KVM_SET_LAPIC, i.e. make the x2APIC fully readonly in KVM. Commit a92e2543d6a8 ("KVM: x86: use hardware-compatible format for APIC ID register"), which added the fixup, didn't intend to allow userspace to modify the x2APIC ID. In fact, that commit is when KVM first started treating the x2APIC ID as readonly, apparently to fix some race: static inline u32 kvm_apic_id(struct kvm_lapic *apic) { - return (kvm_lapic_get_reg(apic, APIC_ID) >> 24) & 0xff; + /* To avoid a race between apic_base and following APIC_ID update when + * switching to x2apic_mode, the x2apic mode returns initial x2apic id. + */ + if (apic_x2apic_mode(apic)) + return apic->vcpu->vcpu_id; + + return kvm_lapic_get_reg(apic, APIC_ID) >> 24; } Furthermore, KVM doesn't support delivering interrupts to vCPUs with a modified x2APIC ID, but KVM *does* return the modified value on a guest RDMSR and for KVM_GET_LAPIC. I.e. no remotely sane setup can actually work with a modified x2APIC ID. Making the x2APIC ID fully readonly fixes a WARN in KVM's optimized map calculation, which expects the LDR to align with the x2APIC ID. WARNING: CPU: 2 PID: 958 at arch/x86/kvm/lapic.c:331 kvm_recalculate_apic_map+0x609/0xa00 [kvm] CPU: 2 PID: 958 Comm: recalc_apic_map Not tainted 6.4.0-rc3-vanilla+ #35 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.2-1-1 04/01/2014 RIP: 0010:kvm_recalculate_apic_map+0x609/0xa00 [kvm] Call Trace: <TASK> kvm_apic_set_state+0x1cf/0x5b0 [kvm] kvm_arch_vcpu_ioctl+0x1806/0x2100 [kvm] kvm_vcpu_ioctl+0x663/0x8a0 [kvm] __x64_sys_ioctl+0xb8/0xf0 do_syscall_64+0x56/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 RIP: 0033:0x7fade8b9dd6f Unfortunately, the WARN can still trigger for other CPUs than the current one by racing against KVM_SET_LAPIC, so remove it completely. Reported-by: Michal Luczaj <[email protected]> Closes: https://lore.kernel.org/all/[email protected] Reported-by: Haoyu Wu <[email protected]> Closes: https://lore.kernel.org/all/[email protected] Reported-by: [email protected] Closes: https://lore.kernel.org/all/[email protected] Signed-off-by: Sean Christopherson <[email protected]> Message-ID: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-08-13KVM: x86: Use this_cpu_ptr() instead of per_cpu_ptr(smp_processor_id())Isaku Yamahata1-4/+2
Use this_cpu_ptr() instead of open coding the equivalent in various user return MSR helpers. Signed-off-by: Isaku Yamahata <[email protected]> Reviewed-by: Chao Gao <[email protected]> Reviewed-by: Yuan Yao <[email protected]> [sean: massage changelog] Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Pankaj Gupta <[email protected]> Message-ID: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-08-13KVM: x86: hyper-v: Remove unused inline function kvm_hv_free_pa_page()Yue Haibing1-1/+0
There is no caller in tree since introduction in commit b4f69df0f65e ("KVM: x86: Make Hyper-V emulation optional") Signed-off-by: Yue Haibing <[email protected]> Message-ID: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-08-13x86/apic: Make x2apic_disable() work correctlyYuntao Wang1-5/+6
x2apic_disable() clears x2apic_state and x2apic_mode unconditionally, even when the state is X2APIC_ON_LOCKED, which prevents the kernel to disable it thereby creating inconsistent state. Due to the early state check for X2APIC_ON, the code path which warns about a locked X2APIC cannot be reached. Test for state < X2APIC_ON instead and move the clearing of the state and mode variables to the place which actually disables X2APIC. [ tglx: Massaged change log. Added Fixes tag. Moved clearing so it's at the right place for back ports ] Fixes: a57e456a7b28 ("x86/apic: Fix fallout from x2apic cleanup") Signed-off-by: Yuntao Wang <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/all/[email protected]
2024-08-13KVM: SVM: Fix an error code in sev_gmem_post_populate()Dan Carpenter1-2/+3
The copy_from_user() function returns the number of bytes which it was not able to copy. Return -EFAULT instead. Fixes: dee5a47cc7a4 ("KVM: SEV: Add KVM_SEV_SNP_LAUNCH_UPDATE command") Signed-off-by: Dan Carpenter <[email protected]> Message-ID: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-08-13KVM: SVM: Fix uninitialized variable bugDan Carpenter1-1/+1
If snp_lookup_rmpentry() fails then "assigned" is printed in the error message but it was never initialized. Initialize it to false. Fixes: dee5a47cc7a4 ("KVM: SEV: Add KVM_SEV_SNP_LAUNCH_UPDATE command") Signed-off-by: Dan Carpenter <[email protected]> Message-ID: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-08-13x86/irq: Fix comment on IRQ vector layoutSohil Mehta1-2/+2
commit f5a3562ec9dd ("x86/irq: Reserve a per CPU IDT vector for posted MSIs") changed the first system vector from LOCAL_TIMER_VECTOR to POSTED_MSI_NOTIFICATION_VECTOR. Reflect this change in the vector layout comment as well. However, instead of pointing to the specific vector, use the FIRST_SYSTEM_VECTOR indirection which essentially refers to the same. This avoids unnecessary modifications to the same comment whenever additional system vectors get added. Signed-off-by: Sohil Mehta <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Jacob Pan <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-13x86/apic: Remove unused extern declarationsYue Haibing1-6/+0
The removal of get_physical_broadcast(), generic_bigsmp_probe() and default_apic_id_valid() left the stale declarations around. Remove them. Signed-off-by: Yue Haibing <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-12introduce fd_file(), convert all accessors to it.Al Viro2-10/+10
For any changes of struct fd representation we need to turn existing accesses to fields into calls of wrappers. Accesses to struct fd::flags are very few (3 in linux/file.h, 1 in net/socket.c, 3 in fs/overlayfs/file.c and 3 more in explicit initializers). Those can be dealt with in the commit converting to new layout; accesses to struct fd::file are too many for that. This commit converts (almost) all of f.file to fd_file(f). It's not entirely mechanical ('file' is used as a member name more than just in struct fd) and it does not even attempt to distinguish the uses in pointer context from those in boolean context; the latter will be eventually turned into a separate helper (fd_empty()). NOTE: mass conversion to fd_empty(), tempting as it might be, is a bad idea; better do that piecewise in commit that convert from fdget...() to CLASS(...). [conflicts in fs/fhandle.c, kernel/bpf/syscall.c, mm/memcontrol.c caught by git; fs/stat.c one got caught by git grep] [fs/xattr.c conflict] Reviewed-by: Christian Brauner <[email protected]> Signed-off-by: Al Viro <[email protected]>
2024-08-12platform/x86/intel/ifs: Add SBAF test image loading supportJithu Joseph1-0/+2
Structural Based Functional Test at Field (SBAF) is a new type of testing that provides comprehensive core test coverage complementing existing IFS tests like Scan at Field (SAF) or ArrayBist. SBAF device will appear as a new device instance (intel_ifs_2) under /sys/devices/virtual/misc. The user interaction necessary to load the test image and test a particular core is the same as the existing scan test (intel_ifs_0). During the loading stage, the driver will look for a file named ff-mm-ss-<batch02x>.sbft in the /lib/firmware/intel/ifs_2 directory. The hardware interaction needed for loading the image is similar to SAF, with the only difference being the MSR addresses used. Reuse the SAF image loading code, passing the SBAF-specific MSR addresses via struct ifs_test_msrs in the driver device data. Unlike SAF, the SBAF test image chunks are further divided into smaller logical entities called bundles. Since the SBAF test is initiated per bundle, cache the maximum number of bundles in the current image, which is used for iterating through bundles during SBAF test execution. Reviewed-by: Ashok Raj <[email protected]> Reviewed-by: Tony Luck <[email protected]> Reviewed-by: Ilpo Järvinen <[email protected]> Signed-off-by: Jithu Joseph <[email protected]> Co-developed-by: Kuppuswamy Sathyanarayanan <[email protected]> Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]> Link: https://lore.kernel.org/r/20240801051814.1935149-3-sathyanarayanan.kuppuswamy@linux.intel.com Signed-off-by: Hans de Goede <[email protected]>
2024-08-10x86/mm: Remove unused CR3_HW_ASID_BITSYosry Ahmed1-3/+0
Commit 6fd166aae78c ("x86/mm: Use/Fix PCID to optimize user/kernel switches") removed the last usage of CR3_HW_ASID_BITS and opted to use X86_CR3_PCID_BITS instead. Remove CR3_HW_ASID_BITS. Signed-off-by: Yosry Ahmed <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-10crypto: x86/aes-gcm - fix PREEMPT_RT issue in gcm_crypt()Eric Biggers1-31/+28
On PREEMPT_RT, kfree() takes sleeping locks and must not be called with preemption disabled. Therefore, on PREEMPT_RT skcipher_walk_done() must not be called from within a kernel_fpu_{begin,end}() pair, even when it's the last call which is guaranteed to not allocate memory. Therefore, move the last skcipher_walk_done() in gcm_crypt() to the end of the function so that it goes after the kernel_fpu_end(). To make this work cleanly, rework the data processing loop to handle only non-last data segments. Fixes: b06affb1cb58 ("crypto: x86/aes-gcm - add VAES and AVX512 / AVX10 optimized AES-GCM") Reported-by: Sebastian Andrzej Siewior <[email protected]> Closes: https://lore.kernel.org/linux-crypto/[email protected] Signed-off-by: Eric Biggers <[email protected]> Tested-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Herbert Xu <[email protected]>
2024-08-09x86/apic: Remove logical destination mode for 64-bitThomas Gleixner2-120/+7
Logical destination mode of the local APIC is used for systems with up to 8 CPUs. It has an advantage over physical destination mode as it allows to target multiple CPUs at once with IPIs. That advantage was definitely worth it when systems with up to 8 CPUs were state of the art for servers and workstations, but that's history. Aside of that there are systems which fail to work with logical destination mode as the ACPI/DMI quirks show and there are AMD Zen1 systems out there which fail when interrupt remapping is enabled as reported by Rob and Christian. The latter problem can be cured by firmware updates, but not all OEMs distribute the required changes. Physical destination mode is guaranteed to work because it is the only way to get a CPU up and running via the INIT/INIT/STARTUP sequence. As the number of CPUs keeps increasing, logical destination mode becomes a less used code path so there is no real good reason to keep it around. Therefore remove logical destination mode support for 64-bit and default to physical destination mode. Reported-by: Rob Newcater <[email protected]> Reported-by: Christian Heusel <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Borislav Petkov (AMD) <[email protected]> Tested-by: Rob Newcater <[email protected]> Link: https://lore.kernel.org/all/877cd5u671.ffs@tglx
2024-08-08x86: Ignore stack unwinding in KCOVDmitry Vyukov1-0/+8
Stack unwinding produces large amounts of uninteresting coverage. It's called from KASAN kmalloc/kfree hooks, fault injection, etc. It's not particularly useful and is not a function of system call args. Ignore that code. Signed-off-by: Dmitry Vyukov <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Alexander Potapenko <[email protected]> Reviewed-by: Marco Elver <[email protected]> Reviewed-by: Andrey Konovalov <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/all/eaf54b8634970b73552dcd38bf9be6ef55238c10.1718092070.git.dvyukov@google.com
2024-08-08x86/entry: Remove unwanted instrumentation in common_interrupt()Dmitry Vyukov2-5/+9
common_interrupt() and related variants call kvm_set_cpu_l1tf_flush_l1d(), which is neither marked noinstr nor __always_inline. So compiler puts it out of line and adds instrumentation to it. Since the call is inside of instrumentation_begin/end(), objtool does not warn about it. The manifestation is that KCOV produces spurious coverage in kvm_set_cpu_l1tf_flush_l1d() in random places because the call happens when preempt count is not yet updated to say that the kernel is in an interrupt. Mark kvm_set_cpu_l1tf_flush_l1d() as __always_inline and move it out of the instrumentation_begin/end() section. It only calls __this_cpu_write() which is already safe to call in noinstr contexts. Fixes: 6368558c3710 ("x86/entry: Provide IDTENTRY_SYSVEC") Signed-off-by: Dmitry Vyukov <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Alexander Potapenko <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/all/3f9a1de9e415fcb53d07dc9e19fa8481bb021b1b.1718092070.git.dvyukov@google.com
2024-08-08x86/mm: Don't print out SRAT table informationLi RongQing1-4/+2
This per CPU log is becoming longer with more and more CPUs in system, which slows down the boot process due to the serializing nature of printk(). The value of this information is dubious and it can be retrieved by lscpu from user space if required.. Downgrade the printk() to pr_debug() so it is still accessible for debug purposes. [ tglx: Massaged changelog ] Signed-off-by: Li RongQing <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-08x86/mtrr: Check if fixed MTRRs exist before saving themAndi Kleen1-1/+1
MTRRs have an obsolete fixed variant for fine grained caching control of the 640K-1MB region that uses separate MSRs. This fixed variant has a separate capability bit in the MTRR capability MSR. So far all x86 CPUs which support MTRR have this separate bit set, so it went unnoticed that mtrr_save_state() does not check the capability bit before accessing the fixed MTRR MSRs. Though on a CPU that does not support the fixed MTRR capability this results in a #GP. The #GP itself is harmless because the RDMSR fault is handled gracefully, but results in a WARN_ON(). Add the missing capability check to prevent this. Fixes: 2b1f6278d77c ("[PATCH] x86: Save the MTRRs of the BSP before booting an AP") Signed-off-by: Andi Kleen <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/all/[email protected]
2024-08-07x86/paravirt: Fix incorrect virt spinlock setting on bare metalChen Yu2-9/+10
The kernel can change spinlock behavior when running as a guest. But this guest-friendly behavior causes performance problems on bare metal. The kernel uses a static key to switch between the two modes. In theory, the static key is enabled by default (run in guest mode) and should be disabled for bare metal (and in some guests that want native behavior or paravirt spinlock). A performance drop is reported when running encode/decode workload and BenchSEE cache sub-workload. Bisect points to commit ce0a1b608bfc ("x86/paravirt: Silence unused native_pv_lock_init() function warning"). When CONFIG_PARAVIRT_SPINLOCKS is disabled the virt_spin_lock_key is incorrectly set to true on bare metal. The qspinlock degenerates to test-and-set spinlock, which decreases the performance on bare metal. Set the default value of virt_spin_lock_key to false. If booting in a VM, enable this key. Later during the VM initialization, if other high-efficient spinlock is preferred (e.g. paravirt-spinlock), or the user wants the native qspinlock (via nopvspin boot commandline), the virt_spin_lock_key is disabled accordingly. This results in the following decision matrix: X86_FEATURE_HYPERVISOR Y Y Y N CONFIG_PARAVIRT_SPINLOCKS Y Y N Y/N PV spinlock Y N N Y/N virt_spin_lock_key N Y/N Y N Fixes: ce0a1b608bfc ("x86/paravirt: Silence unused native_pv_lock_init() function warning") Reported-by: Prem Nath Dey <[email protected]> Reported-by: Xiaoping Zhou <[email protected]> Suggested-by: Dave Hansen <[email protected]> Suggested-by: Qiuxu Zhuo <[email protected]> Suggested-by: Nikolay Borisov <[email protected]> Signed-off-by: Chen Yu <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Nikolay Borisov <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/all/[email protected]
2024-08-07x86/acpi: Remove __ro_after_init from acpi_mp_wake_mailboxZhiquan Li1-1/+1
On a platform using the "Multiprocessor Wakeup Structure"[1] to startup secondary CPUs the control processor needs to memremap() the physical address of the MP Wakeup Structure mailbox to the variable acpi_mp_wake_mailbox, which holds the virtual address of mailbox. To wake up the AP the control processor writes the APIC ID of AP, the wakeup vector and the ACPI_MP_WAKE_COMMAND_WAKEUP command into the mailbox. Current implementation doesn't consider the case which restricts boot time CPU bringup to 1 with the kernel parameter "maxcpus=1" and brings other CPUs online later from user space as it sets acpi_mp_wake_mailbox to read-only after init. So when the first AP is tried to brought online after init, the attempt to update the variable results in a kernel panic. The memremap() call that initializes the variable cannot be moved into acpi_parse_mp_wake() because memremap() is not functional at that point in the boot process. Also as the APs might never be brought up, keep the memremap() call in acpi_wakeup_cpu() so that the operation only takes place when needed. Fixes: 24dd05da8c79 ("x86/apic: Mark acpi_mp_wake_* variables as __ro_after_init") Signed-off-by: Zhiquan Li <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Kirill A. Shutemov <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-07x86/apic: Remove unused inline function apic_set_eoi_cb()Yue Haibing1-1/+0
Commit 2744a7ce34a7 ("x86/apic: Replace acpi_wake_cpu_handler_update() and apic_set_eoi_cb()") left it unused. Signed-off-by: Yue Haibing <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-07x86/ioapic: Cleanup remaining coding style issuesThomas Gleixner1-11/+12
Add missing new lines and reorder variable definitions. Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Qiuxu Zhuo <[email protected]> Tested-by: Breno Leitao <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-07x86/ioapic: Cleanup line breaksThomas Gleixner1-37/+18
80 character limit is history. Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Qiuxu Zhuo <[email protected]> Tested-by: Breno Leitao <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-07x86/ioapic: Cleanup bracket usageThomas Gleixner1-16/+18
Add brackets around if/for constructs as required by coding style or remove pointless line breaks to make it true single line statements which do not require brackets. Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Qiuxu Zhuo <[email protected]> Tested-by: Breno Leitao <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-07x86/ioapic: Cleanup commentsThomas Gleixner1-49/+37
Use proper comment styles and shrink comments to their scope where applicable. Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Qiuxu Zhuo <[email protected]> Tested-by: Breno Leitao <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-07x86/ioapic: Move replace_pin_at_irq_node() to the call siteThomas Gleixner1-22/+18
It's only used by check_timer(). Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Qiuxu Zhuo <[email protected]> Tested-by: Breno Leitao <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-07x86/mpparse: Cleanup apic_printk()sThomas Gleixner1-7/+6
Use the new apic_pr_verbose() helper. Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Qiuxu Zhuo <[email protected]> Tested-by: Breno Leitao <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-07x86/ioapic: Cleanup guarded debug printk()sThomas Gleixner1-38/+29
Cleanup the APIC printk()s which are inside of a apic verbosity guarded region by using apic_dbg() for the KERN_DEBUG level prints. Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Qiuxu Zhuo <[email protected]> Tested-by: Breno Leitao <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-07x86/ioapic: Cleanup apic_printk()sThomas Gleixner1-103/+73
Replace apic_printk($LEVEL) with the corresponding apic_pr_*() helpers and use pr_info() for APIC_QUIET as that is always printed so the indirection is pointless noise. Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Qiuxu Zhuo <[email protected]> Tested-by: Breno Leitao <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-07x86/apic: Cleanup apic_printk()sThomas Gleixner1-46/+35
Use the new apic_pr_*() helpers and cleanup the apic_printk() maze. Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Qiuxu Zhuo <[email protected]> Tested-by: Breno Leitao <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-07x86/apic: Provide apic_printk() helpersThomas Gleixner1-14/+19
apic_printk() requires the APIC verbosity level and printk level which is tedious and horrible to read. Provide helpers to simplify all of that. Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Qiuxu Zhuo <[email protected]> Tested-by: Breno Leitao <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-07x86/ioapic: Use guard() for locking where applicableThomas Gleixner1-128/+64
KISS rules! Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Qiuxu Zhuo <[email protected]> Tested-by: Breno Leitao <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-07x86/ioapic: Cleanup structsThomas Gleixner1-18/+14
Make them conforming to the TIP coding style guide. Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Qiuxu Zhuo <[email protected]> Tested-by: Breno Leitao <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-07x86/ioapic: Mark mp_alloc_timer_irq() __initThomas Gleixner1-2/+2
Only invoked from check_timer() which is __init too. Cleanup the variable declaration while at it. Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Qiuxu Zhuo <[email protected]> Tested-by: Breno Leitao <[email protected]> Reviewed-by: Breno Leitao <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-07x86/ioapic: Handle allocation failures gracefullyThomas Gleixner1-24/+22
Breno observed panics when using failslab under certain conditions during runtime: can not alloc irq_pin_list (-1,0,20) Kernel panic - not syncing: IO-APIC: failed to add irq-pin. Can not proceed panic+0x4e9/0x590 mp_irqdomain_alloc+0x9ab/0xa80 irq_domain_alloc_irqs_locked+0x25d/0x8d0 __irq_domain_alloc_irqs+0x80/0x110 mp_map_pin_to_irq+0x645/0x890 acpi_register_gsi_ioapic+0xe6/0x150 hpet_open+0x313/0x480 That's a pointless panic which is a leftover of the historic IO/APIC code which panic'ed during early boot when the interrupt allocation failed. The only place which might justify panic is the PIT/HPET timer_check() code which tries to figure out whether the timer interrupt is delivered through the IO/APIC. But that code does not require to handle interrupt allocation failures. If the interrupt cannot be allocated then timer delivery fails and it either panics due to that or falls back to legacy mode. Cure this by removing the panic wrapper around __add_pin_to_irq_node() and making mp_irqdomain_alloc() aware of the failure condition and handle it as any other failure in this function gracefully. Reported-by: Breno Leitao <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Breno Leitao <[email protected]> Tested-by: Qiuxu Zhuo <[email protected]> Link: https://lore.kernel.org/all/[email protected] Link: https://lore.kernel.org/all/[email protected]
2024-08-07x86/mm: Fix PTI for i386 some moreThomas Gleixner1-16/+29
So it turns out that we have to do two passes of pti_clone_entry_text(), once before initcalls, such that device and late initcalls can use user-mode-helper / modprobe and once after free_initmem() / mark_readonly(). Now obviously mark_readonly() can cause PMD splits, and pti_clone_pgtable() doesn't like that much. Allow the late clone to split PMDs so that pagetables stay in sync. [peterz: Changelog and comments] Reported-by: Guenter Roeck <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Guenter Roeck <[email protected]> Link: https://lkml.kernel.org/r/[email protected]