aboutsummaryrefslogtreecommitdiff
path: root/arch/x86/kernel
AgeCommit message (Collapse)AuthorFilesLines
2022-11-09KVM: SVM: move MSR_IA32_SPEC_CTRL save/restore to assemblyPaolo Bonzini1-10/+3
Restoration of the host IA32_SPEC_CTRL value is probably too late with respect to the return thunk training sequence. With respect to the user/kernel boundary, AMD says, "If software chooses to toggle STIBP (e.g., set STIBP on kernel entry, and clear it on kernel exit), software should set STIBP to 1 before executing the return thunk training sequence." I assume the same requirements apply to the guest/host boundary. The return thunk training sequence is in vmenter.S, quite close to the VM-exit. On hosts without V_SPEC_CTRL, however, the host's IA32_SPEC_CTRL value is not restored until much later. To avoid this, move the restoration of host SPEC_CTRL to assembly and, for consistency, move the restoration of the guest SPEC_CTRL as well. This is not particularly difficult, apart from some care to cover both 32- and 64-bit, and to share code between SEV-ES and normal vmentry. Cc: [email protected] Fixes: a149180fbcf3 ("x86: Add magic AMD return-thunk") Suggested-by: Jim Mattson <[email protected]> Reviewed-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-11-09KVM: x86: use a separate asm-offsets.c filePaolo Bonzini1-6/+0
This already removes an ugly #include "" from asm-offsets.c, but especially it avoids a future error when trying to define asm-offsets for KVM's svm/svm.h header. This would not work for kernel/asm-offsets.c, because svm/svm.h includes kvm_cache_regs.h which is not in the include path when compiling asm-offsets.c. The problem is not there if the .c file is in arch/x86/kvm. Suggested-by: Sean Christopherson <[email protected]> Cc: [email protected] Fixes: a149180fbcf3 ("x86: Add magic AMD return-thunk") Reviewed-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-11-09x86/fpu/xstate: Fix XSTATE_WARN_ON() to emit relevant diagnosticsAndrew Cooper1-6/+6
"XSAVE consistency problem" has been reported under Xen, but that's the extent of my divination skills. Modify XSTATE_WARN_ON() to force the caller to provide relevant diagnostic information, and modify each caller suitably. For check_xstate_against_struct(), this removes a double WARN() where one will do perfectly fine. CC stable as this has been wonky debugging for 7 years and it is good to have there too. Signed-off-by: Andrew Cooper <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Cc: <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-11-08x86/sgx: use VM_ACCESS_FLAGSKefeng Wang1-2/+2
Simplify VM_READ|VM_WRITE|VM_EXEC with VM_ACCESS_FLAGS. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kefeng Wang <[email protected]> Cc: Jarkko Sakkinen <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Alex Deucher <[email protected]> Cc: "Christian König" <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: David Airlie <[email protected]> Cc: Dinh Nguyen <[email protected]> Cc: "Pan, Xinhui" <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-08x86/traps: avoid KMSAN bugs originating from handle_bug()Alexander Potapenko1-0/+7
There is a case in exc_invalid_op handler that is executed outside the irqentry_enter()/irqentry_exit() region when an UD2 instruction is used to encode a call to __warn(). In that case the `struct pt_regs` passed to the interrupt handler is never unpoisoned by KMSAN (this is normally done in irqentry_enter()), which leads to false positives inside handle_bug(). Use kmsan_unpoison_entry_regs() to explicitly unpoison those registers before using them. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Alexander Potapenko <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Kees Cook <[email protected]> Cc: Marco Elver <[email protected]> Cc: Masahiro Yamada <[email protected]> Cc: Nick Desaulniers <[email protected]> Cc: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-08x86: Fix misc small issuesJiapeng Chong2-3/+3
Fix: ./arch/x86/kernel/traps.c: asm/proto.h is included more than once. ./arch/x86/kernel/alternative.c:1610:2-3: Unneeded semicolon. [ bp: Merge into a single patch. ] Reported-by: Abaci Robot <[email protected]> Signed-off-by: Jiapeng Chong <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/1620902768-53822-1-git-send-email-jiapeng.chong@linux.alibaba.com Link: https://lore.kernel.org/r/[email protected]
2022-11-08x86/sgx: Add overflow check in sgx_validate_offset_length()Borys Popławski1-0/+3
sgx_validate_offset_length() function verifies "offset" and "length" arguments provided by userspace, but was missing an overflow check on their addition. Add it. Fixes: c6d26d370767 ("x86/sgx: Add SGX_IOC_ENCLAVE_ADD_PAGES") Signed-off-by: Borys Popławski <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Jarkko Sakkinen <[email protected]> Cc: [email protected] # v5.11+ Link: https://lore.kernel.org/r/[email protected]
2022-11-04KVM/VMX: Allow exposing EDECCSSA user leaf function to KVM guestKai Huang2-0/+2
The new Asynchronous Exit (AEX) notification mechanism (AEX-notify) allows one enclave to receive a notification in the ERESUME after the enclave exit due to an AEX. EDECCSSA is a new SGX user leaf function (ENCLU[EDECCSSA]) to facilitate the AEX notification handling. The new EDECCSSA is enumerated via CPUID(EAX=0x12,ECX=0x0):EAX[11]. Besides Allowing reporting the new AEX-notify attribute to KVM guests, also allow reporting the new EDECCSSA user leaf function to KVM guests so the guest can fully utilize the AEX-notify mechanism. Similar to existing X86_FEATURE_SGX1 and X86_FEATURE_SGX2, introduce a new scattered X86_FEATURE_SGX_EDECCSSA bit for the new EDECCSSA, and report it in KVM's supported CPUIDs. Note, no additional KVM enabling is required to allow the guest to use EDECCSSA. It's impossible to trap ENCLU (without completely preventing the guest from using SGX). Advertise EDECCSSA as supported purely so that userspace doesn't need to special case EDECCSSA, i.e. doesn't need to manually check host CPUID. The inability to trap ENCLU also means that KVM can't prevent the guest from using EDECCSSA, but that virtualization hole is benign as far as KVM is concerned. EDECCSSA is simply a fancy way to modify internal enclave state. More background about how do AEX-notify and EDECCSSA work: SGX maintains a Current State Save Area Frame (CSSA) for each enclave thread. When AEX happens, the enclave thread context is saved to the CSSA and the CSSA is increased by 1. For a normal ERESUME which doesn't deliver AEX notification, it restores the saved thread context from the previously saved SSA and decreases the CSSA. If AEX-notify is enabled for one enclave, the ERESUME acts differently. Instead of restoring the saved thread context and decreasing the CSSA, it acts like EENTER which doesn't decrease the CSSA but establishes a clean slate thread context using the CSSA for the enclave to handle the notification. After some handling, the enclave must discard the "new-established" SSA and switch back to the previously saved SSA (upon AEX). Otherwise, the enclave will run out of SSA space upon further AEXs and eventually fail to run. To solve this problem, the new EDECCSSA essentially decreases the CSSA. It can be used by the enclave notification handler to switch back to the previous saved SSA when needed, i.e. after it handles the notification. Signed-off-by: Kai Huang <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Acked-by: Sean Christopherson <[email protected]> Acked-by: Jarkko Sakkinen <[email protected]> Link: https://lore.kernel.org/all/20221101022422.858944-1-kai.huang%40intel.com
2022-11-04x86/sgx: Allow enclaves to use Asynchrounous Exit NotificationDave Hansen1-1/+1
Short Version: Allow enclaves to use the new Asynchronous EXit (AEX) notification mechanism. This mechanism lets enclaves run a handler after an AEX event. These handlers can run mitigations for things like SGX-Step[1]. AEX Notify will be made available both on upcoming processors and on some older processors through microcode updates. Long Version: == SGX Attribute Background == The SGX architecture includes a list of SGX "attributes". These attributes ensure consistency and transparency around specific enclave features. As a simple example, the "DEBUG" attribute allows an enclave to be debugged, but also destroys virtually all of SGX security. Using attributes, enclaves can know that they are being debugged. Attributes also affect enclave attestation so an enclave can, for instance, be denied access to secrets while it is being debugged. The kernel keeps a list of known attributes and will only initialize enclaves that use a known set of attributes. This kernel policy eliminates the chance that a new SGX attribute could cause undesired effects. For example, imagine a new attribute was added called "PROVISIONKEY2" that provided similar functionality to "PROVISIIONKEY". A kernel policy that allowed indiscriminate use of unknown attributes and thus PROVISIONKEY2 would undermine the existing kernel policy which limits use of PROVISIONKEY enclaves. == AEX Notify Background == "Intel Architecture Instruction Set Extensions and Future Features - Version 45" is out[2]. There is a new chapter: Asynchronous Enclave Exit Notify and the EDECCSSA User Leaf Function. Enclaves exit can be either synchronous and consensual (EEXIT for instance) or asynchronous (on an interrupt or fault). The asynchronous ones can evidently be exploited to single step enclaves[1], on top of which other naughty things can be built. AEX Notify will be made available both on upcoming processors and on some older processors through microcode updates. == The Problem == These attacks are currently entirely opaque to the enclave since the hardware does the save/restore under the covers. The Asynchronous Enclave Exit Notify (AEX Notify) mechanism provides enclaves an ability to detect and mitigate potential exposure to these kinds of attacks. == The Solution == Define the new attribute value for AEX Notification. Ensure the attribute is cleared from the list reserved attributes. Instead of adding to the open-coded lists of individual attributes, add named lists of privileged (disallowed by default) and unprivileged (allowed by default) attributes. Add the AEX notify attribute as an unprivileged attribute, which will keep the kernel from rejecting enclaves with it set. 1. https://github.com/jovanbulck/sgx-step 2. https://cdrdv2.intel.com/v1/dl/getContent/671368?explicitVersion=true Signed-off-by: Dave Hansen <[email protected]> Acked-by: Jarkko Sakkinen <[email protected]> Tested-by: Haitao Huang <[email protected]> Tested-by: Kai Huang <[email protected]> Link: https://lore.kernel.org/all/20220720191347.1343986-1-dave.hansen%40linux.intel.com
2022-11-03x86/intel_epb: Set Alder Lake N and Raptor Lake P normal EPBSrinivas Pandruvada1-1/+6
Intel processors support additional software hint called EPB ("Energy Performance Bias") to guide the hardware heuristic of power management features to favor increasing dynamic performance or conserve energy consumption. Since this EPB hint is processor specific, the same value of hint can result in different behavior across generations of processors. commit 4ecc933b7d1f ("x86: intel_epb: Allow model specific normal EPB value")' introduced capability to update the default power up EPB based on the CPU model and updated the default EPB to 7 for Alder Lake mobile CPUs. The same change is required for other Alder Lake-N and Raptor Lake-P mobile CPUs as the current default of 6 results in higher uncore power consumption. This increase in power is related to memory clock frequency setting based on the EPB value. Depending on the EPB the minimum memory frequency is set by the firmware. At EPB = 7, the minimum memory frequency is 1/4th compared to EPB = 6. This results in significant power saving for idle and semi-idle workload on a Chrome platform. For example Change in power and performance from EPB change from 6 to 7 on Alder Lake-N: Workload Performance diff (%) power diff ---------------------------------------------------- VP9 FHD30 0 (FPS) -218 mw Google meet 0 (FPS) -385 mw This 200+ mw power saving is very significant for mobile platform for battery life and thermal reasons. But as the workload demands more memory bandwidth, the memory frequency will be increased very fast. There is no power savings for such busy workloads. For example: Workload Performance diff (%) from EPB 6 to 7 ------------------------------------------------------- Speedometer 2.0 -0.8 WebGL Aquarium 10K Fish -0.5 Unity 3D 2018 0.2 WebXPRT3 -0.5 There are run to run variations for performance scores for such busy workloads. So the difference is not significant. Add a new define ENERGY_PERF_BIAS_NORMAL_POWERSAVE for EPB 7 and use it for Alder Lake-N and Raptor Lake-P mobile CPUs. This modification is done originally by Jeremy Compostella <[email protected]>. Signed-off-by: Srinivas Pandruvada <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Rafael J. Wysocki <[email protected]> Link: https://lore.kernel.org/all/20221027220056.1534264-1-srinivas.pandruvada%40linux.intel.com
2022-11-02x86/microcode: Drop struct ucode_cpu_info.validBorislav Petkov2-3/+2
It is not needed anymore. Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Ashok Raj <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-11-02x86/microcode: Do some minor fixupsBorislav Petkov1-6/+5
Improve debugging printks and fixup formatting. Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Ashok Raj <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-11-02x86/microcode: Kill refresh_fwBorislav Petkov3-6/+4
request_microcode_fw() can always request firmware now so drop this superfluous argument. Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Ashok Raj <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-11-02x86/microcode: Simplify init path even moreBorislav Petkov1-104/+16
Get rid of all the IPI-sending functions and their wrappers and use those which are supposed to be called on each CPU. Thus: - microcode_init_cpu() gets called on each CPU on init, applying any new microcode that the driver might've found on the filesystem. - mc_cpu_starting() simply tries to apply cached microcode as this is the cpuhp starting callback which gets called on CPU resume too. Even if the driver init function is a late initcall, there is no filesystem by then (not even a hdd driver has been loaded yet) so a new firmware load attempt cannot simply be done. It is pointless anyway - for that there's late loading if one really needs it. Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Ashok Raj <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-11-02x86/microcode: Rip out the subsys interface gunkBorislav Petkov1-58/+20
This is a left-over from the old days when CPU hotplug wasn't as robust as it is now. Currently, microcode gets loaded early on the CPU init path and there's no need to attempt to load it again, which that subsys interface callback is doing. The only other thing that the subsys interface init path was doing is adding the /sys/devices/system/cpu/cpu*/microcode/ hierarchy. So add a function which gets called on each CPU after all the necessary driver setup has happened. Use schedule_on_each_cpu() which can block because the sysfs creating code does kmem_cache_zalloc() which can block too and the initial version of this where it did that setup in an IPI handler of on_each_cpu() can cause a deadlock of the sort: lock(fs_reclaim); <Interrupt> lock(fs_reclaim); as the IPI handler runs in IRQ context. Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Ashok Raj <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-11-01x86: Improve formatting of user_regset arraysRick Edgecombe1-42/+65
Back in 2018, Ingo Molnar suggested[0] to improve the formatting of the struct user_regset arrays. They have multiple member initializations per line and some lines exceed 100 chars. Reformat them like he suggested. [0] https://lore.kernel.org/lkml/[email protected]/ Signed-off-by: Rick Edgecombe <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Link: https://lore.kernel.org/all/20221021221803.10910-3-rick.p.edgecombe%40intel.com
2022-11-01x86: Separate out x86_regset for 32 and 64 bitRick Edgecombe1-24/+43
In fill_thread_core_info() the ptrace accessible registers are collected for a core file to be written out as notes. The note array is allocated from a size calculated by iterating the user regset view, and counting the regsets that have a non-zero core_note_type. However, this only allows for there to be non-zero core_note_type at the end of the regset view. If there are any in the middle, fill_thread_core_info() will overflow the note allocation, as it iterates over the size of the view and the allocation would be smaller than that. To apparently avoid this problem, x86_32_regsets and x86_64_regsets need to be constructed in a special way. They both draw their indices from a shared enum x86_regset, but 32 bit and 64 bit don't all support the same regsets and can be compiled in at the same time in the case of IA32_EMULATION. So this enum has to be laid out in a special way such that there are no gaps for both x86_32_regsets and x86_64_regsets. This involves ordering them just right by creating aliases for enum’s that are only in one view or the other, or creating multiple versions like REGSET32_IOPERM/REGSET64_IOPERM. So the collection of the registers tries to minimize the size of the allocation, but it doesn’t quite work. Then the x86 ptrace side works around it by constructing the enum just right to avoid a problem. In the end there is no functional problem, but it is somewhat strange and fragile. It could also be improved like this [1], by better utilizing the smaller array, but this still wastes space in the regset array’s if they are not carefully crafted to avoid gaps. Instead, just fully separate out the enums and give them separate 32 and 64 enum names. Add some bitsize-free defines for REGSET_GENERAL and REGSET_FP since they are the only two referred to in bitsize generic code. While introducing a bunch of new 32/64 enums, change the pattern of the name from REGSET_FOO32 to REGSET32_FOO to better indicate that the 32 is in reference to the CPU mode and not the register size, as suggested by Eric Biederman. This should have no functional change and is only changing how constants are generated and referred to. [1] https://lore.kernel.org/lkml/[email protected]/ Signed-off-by: Rick Edgecombe <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Link: https://lore.kernel.org/all/20221021221803.10910-2-rick.p.edgecombe%40intel.com
2022-11-01x86/cfi: Add boot time hash randomizationPeter Zijlstra1-12/+108
In order to avoid known hashes (from knowing the boot image), randomize the CFI hashes with a per-boot random seed. Suggested-by: Kees Cook <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Kees Cook <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-11-01x86/cfi: Boot time selection of CFI schemePeter Zijlstra1-18/+81
Add the "cfi=" boot parameter to allow people to select a CFI scheme at boot time. Mostly useful for development / debugging. Requested-by: Kees Cook <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Kees Cook <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-11-01x86/ibt: Implement FineIBTPeter Zijlstra4-14/+269
Implement an alternative CFI scheme that merges both the fine-grained nature of kCFI but also takes full advantage of the coarse grained hardware CFI as provided by IBT. To contrast: kCFI is a pure software CFI scheme and relies on being able to read text -- specifically the instruction *before* the target symbol, and does the hash validation *before* doing the call (otherwise control flow is compromised already). FineIBT is a software and hardware hybrid scheme; by ensuring every branch target starts with a hash validation it is possible to place the hash validation after the branch. This has several advantages: o the (hash) load is avoided; no memop; no RX requirement. o IBT WAIT-FOR-ENDBR state is a speculation stop; by placing the hash validation in the immediate instruction after the branch target there is a minimal speculation window and the whole is a viable defence against SpectreBHB. o Kees feels obliged to mention it is slightly more vulnerable when the attacker can write code. Obviously this patch relies on kCFI, but additionally it also relies on the padding from the call-depth-tracking patches. It uses this padding to place the hash-validation while the call-sites are re-written to modify the indirect target to be 16 bytes in front of the original target, thus hitting this new preamble. Notably, there is no hardware that needs call-depth-tracking (Skylake) and supports IBT (Tigerlake and onwards). Suggested-by: Joao Moreira (Intel) <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Kees Cook <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-10-31x86/sgx: Reduce delay and interference of enclave releaseReinette Chatre1-4/+19
commit 8795359e35bc ("x86/sgx: Silence softlockup detection when releasing large enclaves") introduced a cond_resched() during enclave release where the EREMOVE instruction is applied to every 4k enclave page. Giving other tasks an opportunity to run while tearing down a large enclave placates the soft lockup detector but Iqbal found that the fix causes a 25% performance degradation of a workload run using Gramine. Gramine maintains a 1:1 mapping between processes and SGX enclaves. That means if a workload in an enclave creates a subprocess then Gramine creates a duplicate enclave for that subprocess to run in. The consequence is that the release of the enclave used to run the subprocess can impact the performance of the workload that is run in the original enclave, especially in large enclaves when SGX2 is not in use. The workload run by Iqbal behaves as follows: Create enclave (enclave "A") /* Initialize workload in enclave "A" */ Create enclave (enclave "B") /* Run subprocess in enclave "B" and send result to enclave "A" */ Release enclave (enclave "B") /* Run workload in enclave "A" */ Release enclave (enclave "A") The performance impact of releasing enclave "B" in the above scenario is amplified when there is a lot of SGX memory and the enclave size matches the SGX memory. When there is 128GB SGX memory and an enclave size of 128GB, from the time enclave "B" starts the 128GB SGX memory is oversubscribed with a combined demand for 256GB from the two enclaves. Before commit 8795359e35bc ("x86/sgx: Silence softlockup detection when releasing large enclaves") enclave release was done in a tight loop without giving other tasks a chance to run. Even though the system experienced soft lockups the workload (run in enclave "A") obtained good performance numbers because when the workload started running there was no interference. Commit 8795359e35bc ("x86/sgx: Silence softlockup detection when releasing large enclaves") gave other tasks opportunity to run while an enclave is released. The impact of this in this scenario is that while enclave "B" is released and needing to access each page that belongs to it in order to run the SGX EREMOVE instruction on it, enclave "A" is attempting to run the workload needing to access the enclave pages that belong to it. This causes a lot of swapping due to the demand for the oversubscribed SGX memory. Longer latencies are experienced by the workload in enclave "A" while enclave "B" is released. Improve the performance of enclave release while still avoiding the soft lockup detector with two enhancements: - Only call cond_resched() after XA_CHECK_SCHED iterations. - Use the xarray advanced API to keep the xarray locked for XA_CHECK_SCHED iterations instead of locking and unlocking at every iteration. This batching solution is copied from sgx_encl_may_map() that also iterates through all enclave pages using this technique. With this enhancement the workload experiences a 5% performance degradation when compared to a kernel without commit 8795359e35bc ("x86/sgx: Silence softlockup detection when releasing large enclaves"), an improvement to the reported 25% degradation, while still placating the soft lockup detector. Scenarios with poor performance are still possible even with these enhancements. For example, short workloads creating sub processes while running in large enclaves. Further performance improvements are pursued in user space through avoiding to create duplicate enclaves for certain sub processes, and using SGX2 that will do lazy allocation of pages as needed so enclaves created for sub processes start quickly and release quickly. Fixes: 8795359e35bc ("x86/sgx: Silence softlockup detection when releasing large enclaves") Reported-by: Md Iqbal Hossain <[email protected]> Signed-off-by: Reinette Chatre <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Tested-by: Md Iqbal Hossain <[email protected]> Link: https://lore.kernel.org/all/00efa80dd9e35dc85753e1c5edb0344ac07bb1f0.1667236485.git.reinette.chatre%40intel.com
2022-10-31x86/espfix: Use get_random_long() rather than archrandomJason A. Donenfeld1-11/+1
A call is made to arch_get_random_longs() and rdtsc(), rather than just using get_random_long(), because this was written during a time when very early boot would give abysmal entropy. These days, a call to get_random_long() at early boot will incorporate RDRAND, RDTSC, and more, without having to do anything bespoke. In fact, the situation is now such that on the majority of x86 systems, the pool actually is initialized at this point, even though it doesn't need to be for get_random_long() to still return something better than what this function currently does. So simplify this to just call get_random_long() instead. Signed-off-by: Jason A. Donenfeld <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-10-31x86/mce: Use severity table to handle uncorrected errors in kernelTony Luck1-3/+5
mce_severity_intel() has a special case to promote UC and AR errors in kernel context to PANIC severity. The "AR" case is already handled with separate entries in the severity table for all instruction fetch errors, and those data fetch errors that are not in a recoverable area of the kernel (i.e. have an extable fixup entry). Add an entry to the severity table for UC errors in kernel context that reports severity = PANIC. Delete the special case code from mce_severity_intel(). Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-10-31x86/i8259: Make default_legacy_pic staticChen Lifu1-1/+1
The symbol is not used outside of the file, so mark it static. Signed-off-by: Chen Lifu <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-10-27x86/MCE/AMD: Clear DFR errors found in THR handlerYazen Ghannam1-13/+20
AMD's MCA Thresholding feature counts errors of all severity levels, not just correctable errors. If a deferred error causes the threshold limit to be reached (it was the error that caused the overflow), then both a deferred error interrupt and a thresholding interrupt will be triggered. The order of the interrupts is not guaranteed. If the threshold interrupt handler is executed first, then it will clear MCA_STATUS for the error. It will not check or clear MCA_DESTAT which also holds a copy of the deferred error. When the deferred error interrupt handler runs it will not find an error in MCA_STATUS, but it will find the error in MCA_DESTAT. This will cause two errors to be logged. Check for deferred errors when handling a threshold interrupt. If a bank contains a deferred error, then clear the bank's MCA_DESTAT register. Define a new helper function to do the deferred error check and clearing of MCA_DESTAT. [ bp: Simplify, convert comment to passive voice. ] Fixes: 37d43acfd79f ("x86/mce/AMD: Redo error logging from APIC LVT interrupt handlers") Signed-off-by: Yazen Ghannam <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/r/[email protected]
2022-10-25x86/retpoline: Fix crash printing warningDan Carpenter1-1/+1
The first argument of WARN() is a condition, so this will use "addr" as the format string and possibly crash. Fixes: 3b6c1747da48 ("x86/retpoline: Add SKL retthunk retpolines") Signed-off-by: Dan Carpenter <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Link: https://lore.kernel.org/all/Y1gBoUZrRK5N%2FlCB@kili/
2022-10-24x86/resctrl: Remove arch_has_empty_bitmapsBabu Moger2-4/+1
The field arch_has_empty_bitmaps is not required anymore. The field min_cbm_bits is enough to validate the CBM (capacity bit mask) if the architecture can support the zero CBM or not. Suggested-by: Reinette Chatre <[email protected]> Signed-off-by: Babu Moger <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Reviewed-by: Fenghua Yu <[email protected]> Link: https://lore.kernel.org/r/166430979654.372014.615622285687642644.stgit@bmoger-ubuntu
2022-10-23Merge tag 'objtool_urgent_for_v6.1_rc2' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull objtool fix from Borislav Petkov: - Fix ORC stack unwinding when GCOV is enabled * tag 'objtool_urgent_for_v6.1_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/unwind/orc: Fix unreliable stack dump with gcov
2022-10-22Merge branch 'x86/urgent' into x86/core, to resolve conflictIngo Molnar6-49/+60
There's a conflict between the call-depth tracking commits in x86/core: ee3e2469b346 ("x86/ftrace: Make it call depth tracking aware") 36b64f101219 ("x86/ftrace: Rebalance RSB") eac828eaef29 ("x86/ftrace: Remove ftrace_epilogue()") And these fixes in x86/urgent: 883bbbffa5a4 ("ftrace,kcfi: Separate ftrace_stub() and ftrace_stub_graph()") b5f1fc318440 ("x86/ftrace: Remove ftrace_epilogue()") It's non-trivial overlapping modifications - resolve them. Conflicts: arch/x86/kernel/ftrace_64.S Signed-off-by: Ingo Molnar <[email protected]>
2022-10-21x86/fpu: Fix copy_xstate_to_uabi() to copy init states correctlyChang S. Bae1-0/+9
When an extended state component is not present in fpstate, but in init state, the function copies from init_fpstate via copy_feature(). But, dynamic states are not present in init_fpstate because of all-zeros init states. Then retrieving them from init_fpstate will explode like this: BUG: kernel NULL pointer dereference, address: 0000000000000000 ... RIP: 0010:memcpy_erms+0x6/0x10 ? __copy_xstate_to_uabi_buf+0x381/0x870 fpu_copy_guest_fpstate_to_uabi+0x28/0x80 kvm_arch_vcpu_ioctl+0x14c/0x1460 [kvm] ? __this_cpu_preempt_check+0x13/0x20 ? vmx_vcpu_put+0x2e/0x260 [kvm_intel] kvm_vcpu_ioctl+0xea/0x6b0 [kvm] ? kvm_vcpu_ioctl+0xea/0x6b0 [kvm] ? __fget_light+0xd4/0x130 __x64_sys_ioctl+0xe3/0x910 ? debug_smp_processor_id+0x17/0x20 ? fpregs_assert_state_consistent+0x27/0x50 do_syscall_64+0x3f/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd Adjust the 'mask' to zero out the userspace buffer for the features that are not available both from fpstate and from init_fpstate. The dynamic features depend on the compacted XSAVE format. Ensure it is enabled before reading XCOMP_BV in init_fpstate. Fixes: 2308ee57d93d ("x86/fpu/amx: Enable the AMX feature in 64-bit mode") Reported-by: Yuan Yao <[email protected]> Suggested-by: Dave Hansen <[email protected]> Signed-off-by: Chang S. Bae <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Tested-by: Yuan Yao <[email protected]> Link: https://lore.kernel.org/lkml/BYAPR11MB3717EDEF2351C958F2C86EED95259@BYAPR11MB3717.namprd11.prod.outlook.com/ Link: https://lkml.kernel.org/r/[email protected]
2022-10-21x86/unwind/orc: Fix unreliable stack dump with gcovChen Zhongjin1-1/+1
When a console stack dump is initiated with CONFIG_GCOV_PROFILE_ALL enabled, show_trace_log_lvl() gets out of sync with the ORC unwinder, causing the stack trace to show all text addresses as unreliable: # echo l > /proc/sysrq-trigger [ 477.521031] sysrq: Show backtrace of all active CPUs [ 477.523813] NMI backtrace for cpu 0 [ 477.524492] CPU: 0 PID: 1021 Comm: bash Not tainted 6.0.0 #65 [ 477.525295] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-1.fc36 04/01/2014 [ 477.526439] Call Trace: [ 477.526854] <TASK> [ 477.527216] ? dump_stack_lvl+0xc7/0x114 [ 477.527801] ? dump_stack+0x13/0x1f [ 477.528331] ? nmi_cpu_backtrace.cold+0xb5/0x10d [ 477.528998] ? lapic_can_unplug_cpu+0xa0/0xa0 [ 477.529641] ? nmi_trigger_cpumask_backtrace+0x16a/0x1f0 [ 477.530393] ? arch_trigger_cpumask_backtrace+0x1d/0x30 [ 477.531136] ? sysrq_handle_showallcpus+0x1b/0x30 [ 477.531818] ? __handle_sysrq.cold+0x4e/0x1ae [ 477.532451] ? write_sysrq_trigger+0x63/0x80 [ 477.533080] ? proc_reg_write+0x92/0x110 [ 477.533663] ? vfs_write+0x174/0x530 [ 477.534265] ? handle_mm_fault+0x16f/0x500 [ 477.534940] ? ksys_write+0x7b/0x170 [ 477.535543] ? __x64_sys_write+0x1d/0x30 [ 477.536191] ? do_syscall_64+0x6b/0x100 [ 477.536809] ? entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 477.537609] </TASK> This happens when the compiled code for show_stack() has a single word on the stack, and doesn't use a tail call to show_stack_log_lvl(). (CONFIG_GCOV_PROFILE_ALL=y is the only known case of this.) Then the __unwind_start() skip logic hits an off-by-one bug and fails to unwind all the way to the intended starting frame. Fix it by reverting the following commit: f1d9a2abff66 ("x86/unwind/orc: Don't skip the first frame for inactive tasks") The original justification for that commit no longer exists. That original issue was later fixed in a different way, with the following commit: f2ac57a4c49d ("x86/unwind/orc: Fix inactive tasks with stack pointer in %sp on GCC 10 compiled kernels") Fixes: f1d9a2abff66 ("x86/unwind/orc: Don't skip the first frame for inactive tasks") Signed-off-by: Chen Zhongjin <[email protected]> [jpoimboe: rewrite commit log] Signed-off-by: Josh Poimboeuf <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]>
2022-10-20ftrace,kcfi: Separate ftrace_stub() and ftrace_stub_graph()Peter Zijlstra1-8/+9
Different function signatures means they needs to be different functions; otherwise CFI gets upset. As triggered by the ftrace boot tests: [] CFI failure at ftrace_return_to_handler+0xac/0x16c (target: ftrace_stub+0x0/0x14; expected type: 0x0a5d5347) Fixes: 3c516f89e17e ("x86: Add support for CONFIG_CFI_CLANG") Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Mark Rutland <[email protected]> Tested-by: Mark Rutland <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2022-10-20x86/ftrace: Remove ftrace_epilogue()Peter Zijlstra1-15/+6
Remove the weird jumps to RET and simply use RET. This then promotes ftrace_stub() to a real function; which becomes important for kcfi. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Peter Zijlstra <[email protected]>
2022-10-20x86/mtrr: Remove unused cyrix_set_all() functionJuergen Gross1-34/+0
The Cyrix CPU specific MTRR function cyrix_set_all() will never be called as the mtrr_ops->set_all() callback will only be called in the use_intel() case, which would require the use_intel_if member of struct mtrr_ops to be set, which isn't the case for Cyrix. Signed-off-by: Juergen Gross <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-10-19x86/mtrr: Add comment for set_mtrr_state() serializationJuergen Gross1-1/+4
Add a comment about set_mtrr_state() needing serialization. [ bp: Touchups. ] Suggested-by: Borislav Petkov <[email protected]> Signed-off-by: Juergen Gross <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-10-19x86/signal/64: Move 64-bit signal code to its own fileBrian Gerst3-378/+385
[ bp: Fixup merge conflict caused by changes coming from the kbuild tree. ] Signed-off-by: Brian Gerst <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Acked-by: "Eric W. Biederman" <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Borislav Petkov <[email protected]>
2022-10-19x86/signal/32: Merge native and compat 32-bit signal codeBrian Gerst3-215/+387
There are significant differences between signal handling on 32-bit vs. 64-bit, like different structure layouts and legacy syscalls. Instead of duplicating that code for native and compat, merge both versions into one file. Signed-off-by: Brian Gerst <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Acked-by: "Eric W. Biederman" <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Borislav Petkov <[email protected]>
2022-10-19x86/signal: Add ABI prefixes to frame setup functionsBrian Gerst1-11/+7
Add ABI prefixes to the frame setup functions that didn't already have them. To avoid compiler warnings and prepare for moving these functions to separate files, make them non-static. Signed-off-by: Brian Gerst <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Acked-by: "Eric W. Biederman" <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Borislav Petkov <[email protected]>
2022-10-19x86/signal: Merge get_sigframe()Brian Gerst1-42/+38
Adapt the native get_sigframe() function so that the compat signal code can use it. Signed-off-by: Brian Gerst <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Acked-by: "Eric W. Biederman" <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Borislav Petkov <[email protected]>
2022-10-19x86/signal: Remove sigset_t parameter from frame setup functionsBrian Gerst1-16/+12
Push down the call to sigmask_to_save() into the frame setup functions. Thus, remove the use of compat_sigset_t outside of the compat code. Signed-off-by: Brian Gerst <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Acked-by: "Eric W. Biederman" <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Borislav Petkov <[email protected]>
2022-10-19x86/signal: Remove sig parameter from frame setup functionsBrian Gerst1-12/+11
Passing the signal number as a separate parameter is unnecessary, since it is always ksig->sig. Signed-off-by: Brian Gerst <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Acked-by: "Eric W. Biederman" <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Borislav Petkov <[email protected]>
2022-10-18x86/resctrl: Fix min_cbm_bits for AMDBabu Moger1-6/+2
AMD systems support zero CBM (capacity bit mask) for cache allocation. That is reflected in rdt_init_res_defs_amd() by: r->cache.arch_has_empty_bitmaps = true; However given the unified code in cbm_validate(), checking for: val == 0 && !arch_has_empty_bitmaps is not enough because of another check in cbm_validate(): if ((zero_bit - first_bit) < r->cache.min_cbm_bits) The default value of r->cache.min_cbm_bits = 1. Leading to: $ cd /sys/fs/resctrl $ mkdir foo $ cd foo $ echo L3:0=0 > schemata -bash: echo: write error: Invalid argument $ cat /sys/fs/resctrl/info/last_cmd_status Need at least 1 bits in the mask Initialize the min_cbm_bits to 0 for AMD. Also, remove the default setting of min_cbm_bits and initialize it separately. After the fix: $ cd /sys/fs/resctrl $ mkdir foo $ cd foo $ echo L3:0=0 > schemata $ cat /sys/fs/resctrl/info/last_cmd_status ok Fixes: 316e7f901f5a ("x86/resctrl: Add struct rdt_cache::arch_has_{sparse, empty}_bitmaps") Co-developed-by: Stephane Eranian <[email protected]> Signed-off-by: Stephane Eranian <[email protected]> Signed-off-by: Babu Moger <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Ingo Molnar <[email protected]> Reviewed-by: James Morse <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Reviewed-by: Fenghua Yu <[email protected]> Cc: <[email protected]> Link: https://lore.kernel.org/lkml/[email protected]
2022-10-18x86/microcode/AMD: Apply the patch early on every logical threadBorislav Petkov1-3/+13
Currently, the patch application logic checks whether the revision needs to be applied on each logical CPU (SMT thread). Therefore, on SMT designs where the microcode engine is shared between the two threads, the application happens only on one of them as that is enough to update the shared microcode engine. However, there are microcode patches which do per-thread modification, see Link tag below. Therefore, drop the revision check and try applying on each thread. This is what the BIOS does too so this method is very much tested. Btw, change only the early paths. On the late loading paths, there's no point in doing per-thread modification because if is it some case like in the bugzilla below - removing a CPUID flag - the kernel cannot go and un-use features it has detected are there early. For that, one should use early loading anyway. [ bp: Fixes does not contain the oldest commit which did check for equality but that is good enough. ] Fixes: 8801b3fcb574 ("x86/microcode/AMD: Rework container parsing") Reported-by: Ștefan Talpalaru <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Tested-by: Ștefan Talpalaru <[email protected]> Cc: <[email protected]> Link: https://bugzilla.kernel.org/show_bug.cgi?id=216211
2022-10-17x86/topology: Fix duplicated core ID within a packageZhang Rui1-1/+1
Today, core ID is assumed to be unique within each package. But an AlderLake-N platform adds a Module level between core and package, Linux excludes the unknown modules bits from the core ID, resulting in duplicate core ID's. To keep core ID unique within a package, Linux must include all APIC-ID bits for known or unknown levels above the core and below the package in the core ID. It is important to understand that core ID's have always come directly from the APIC-ID encoding, which comes from the BIOS. Thus there is no guarantee that they start at 0, or that they are contiguous. As such, naively using them for array indexes can be problematic. [ dhansen: un-known -> unknown ] Fixes: 7745f03eb395 ("x86/topology: Add CPUID.1F multi-die/package support") Suggested-by: Len Brown <[email protected]> Signed-off-by: Zhang Rui <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Len Brown <[email protected]> Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected]
2022-10-17x86/topology: Fix multiple packages shown on a single-package systemZhang Rui1-4/+10
CPUID.1F/B does not enumerate Package level explicitly, instead, all the APIC-ID bits above the enumerated levels are assumed to be package ID bits. Current code gets package ID by shifting out all the APIC-ID bits that Linux supports, rather than shifting out all the APIC-ID bits that CPUID.1F enumerates. This introduces problems when CPUID.1F enumerates a level that Linux does not support. For example, on a single package AlderLake-N, there are 2 Ecore Modules with 4 atom cores in each module. Linux does not support the Module level and interprets the Module ID bits as package ID and erroneously reports a multi module system as a multi-package system. Fix this by using APIC-ID bits above all the CPUID.1F enumerated levels as package ID. [ dhansen: spelling fix ] Fixes: 7745f03eb395 ("x86/topology: Add CPUID.1F multi-die/package support") Suggested-by: Len Brown <[email protected]> Signed-off-by: Zhang Rui <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Len Brown <[email protected]> Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected]
2022-10-17x86/bugs: Add retbleed=forcePeter Zijlstra (Intel)1-0/+2
Debug aid, allows running retbleed=force,stuff on non-affected uarchs Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
2022-10-17x86/retbleed: Add call depth tracking mitigationThomas Gleixner1-2/+30
The fully secure mitigation for RSB underflow on Intel SKL CPUs is IBRS, which inflicts up to 30% penalty for pathological syscall heavy work loads. Software based call depth tracking and RSB refill is not perfect, but reduces the attack surface massively. The penalty for the pathological case is about 8% which is still annoying but definitely more palatable than IBRS. Add a retbleed=stuff command line option to enable the call depth tracking and software refill of the RSB. This gives admins a choice. IBeeRS are safe and cause headaches, call depth tracking is considered to be s(t)ufficiently safe. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-10-17x86/ftrace: Make it call depth tracking awarePeter Zijlstra3-7/+33
Since ftrace has trampolines, don't use thunks for the __fentry__ site but instead require that every function called from there includes accounting. This very much includes all the direct-call functions. Additionally, ftrace uses ROP tricks in two places: - return_to_handler(), and - ftrace_regs_caller() when pt_regs->orig_ax is set by a direct-call. return_to_handler() already uses a retpoline to replace an indirect-jump to defeat IBT, since this is a jump-type retpoline, make sure there is no accounting done and ALTERNATIVE the RET into a ret. ftrace_regs_caller() does much the same and gets the same treatment. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-10-17x86/ftrace: Rebalance RSBPeter Zijlstra1-0/+11
ftrace_regs_caller() uses a PUSH;RET pattern to tail-call into a direct-call function, this unbalances the RSB, fix that. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-10-17x86/ftrace: Remove ftrace_epilogue()Peter Zijlstra1-15/+6
Remove the weird jumps to RET and simply use RET. This then promotes ftrace_stub() to a real function; which becomes important for kcfi. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]