aboutsummaryrefslogtreecommitdiff
path: root/arch/x86/kernel
AgeCommit message (Collapse)AuthorFilesLines
2020-05-28Merge tag 'v5.7-rc7' into perf/core, to pick up fixesIngo Molnar6-72/+175
Signed-off-by: Ingo Molnar <[email protected]>
2020-05-27copy_xstate_to_kernel(): don't leave parts of destination uninitializedAl Viro1-38/+48
copy the corresponding pieces of init_fpstate into the gaps instead. Cc: [email protected] Tested-by: Alexander Potapenko <[email protected]> Acked-by: Borislav Petkov <[email protected]> Signed-off-by: Al Viro <[email protected]>
2020-05-27x86/apb_timer: Drop unused TSC calibrationJohan Hovold1-53/+0
Drop the APB-timer TSC calibration, which hasn't been used since the removal of Moorestown support by commit 1a8359e411eb ("x86/mid: Remove Intel Moorestown"). Signed-off-by: Johan Hovold <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Acked-by: Thomas Gleixner <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-05-26x86/io_apic: Remove unused function mp_init_irq_at_boot()YueHaibing1-13/+0
There are no callers in-tree anymore since ef9e56d894ea ("x86/ioapic: Remove obsolete post hotplug update") so remove it. Signed-off-by: YueHaibing <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-05-26x86/apic: Make TSC deadline timer detection message visibleBorislav Petkov1-1/+1
The commit c84cb3735fd5 ("x86/apic: Move TSC deadline timer debug printk") removed the message which said that the deadline timer was enabled. It added a pr_debug() message which is issued when deadline timer validation succeeds. Well, issued only when CONFIG_DYNAMIC_DEBUG is enabled - otherwise pr_debug() calls get optimized away if DEBUG is not defined in the compilation unit. Therefore, make the above message pr_info() so that it is visible in dmesg. Signed-off-by: Borislav Petkov <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-05-25x86/reboot/quirks: Add MacBook6,1 reboot quirkHill Ma1-0/+8
On MacBook6,1 reboot would hang unless parameter reboot=pci is added. Make it automatic. Signed-off-by: Hill Ma <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected]
2020-05-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller2-7/+24
The MSCC bug fix in 'net' had to be slightly adjusted because the register accesses are done slightly differently in net-next. Signed-off-by: David S. Miller <[email protected]>
2020-05-23x86/apic/uv: Remove code for unused distributed GRU modeSteve Wahl1-58/+1
Distributed GRU mode appeared in only one generation of UV hardware, and no version of the BIOS has shipped with this feature enabled, and we have no plans to ever change that. The gru.s3.mode check has always been and will continue to be false. So remove this dead code. Signed-off-by: Steve Wahl <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Acked-by: Dimitri Sivanich <[email protected]> Link: https://lkml.kernel.org/r/20200513221123.GJ3240@raspberrypi
2020-05-22x86/unwind/orc: Fix unwind_get_return_address_ptr() for inactive tasksJosh Poimboeuf1-0/+7
Normally, show_trace_log_lvl() scans the stack, looking for text addresses to print. In parallel, it unwinds the stack with unwind_next_frame(). If the stack address matches the pointer returned by unwind_get_return_address_ptr() for the current frame, the text address is printed normally without a question mark. Otherwise it's considered a breadcrumb (potentially from a previous call path) and it's printed with a question mark to indicate that the address is unreliable and typically can be ignored. Since the following commit: f1d9a2abff66 ("x86/unwind/orc: Don't skip the first frame for inactive tasks") ... for inactive tasks, show_trace_log_lvl() prints *only* unreliable addresses (prepended with '?'). That happens because, for the first frame of an inactive task, unwind_get_return_address_ptr() returns the wrong return address pointer: one word *below* the task stack pointer. show_trace_log_lvl() starts scanning at the stack pointer itself, so it never finds the first 'reliable' address, causing only guesses to being printed. The first frame of an inactive task isn't a normal stack frame. It's actually just an instance of 'struct inactive_task_frame' which is left behind by __switch_to_asm(). Now that this inactive frame is actually exposed to callers, fix unwind_get_return_address_ptr() to interpret it properly. Fixes: f1d9a2abff66 ("x86/unwind/orc: Don't skip the first frame for inactive tasks") Reported-by: Tetsuo Handa <[email protected]> Signed-off-by: Josh Poimboeuf <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/20200522135435.vbxs7umku5pyrdbk@treble
2020-05-22x86/amd_nb: Add AMD family 17h model 60h PCI IDsAlexander Monakov1-0/+5
Add PCI IDs for AMD Renoir (4000-series Ryzen CPUs). This is necessary to enable support for temperature sensors via the k10temp module. Signed-off-by: Alexander Monakov <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Acked-by: Yazen Ghannam <[email protected]> Acked-by: Guenter Roeck <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-05-21x86/tsc: Add tsc_early_khz command line parameterKrzysztof Piecuch1-1/+11
Changing base clock frequency directly impacts TSC Hz but not CPUID.16h value. An overclocked CPU supporting CPUID.16h and with partial CPUID.15h support will set TSC KHZ according to "best guess" given by CPUID.16h relying on tsc_refine_calibration_work to give better numbers later. tsc_refine_calibration_work will refuse to do its work when the outcome is off the early TSC KHZ value by more than 1% which is certain to happen on an overclocked system. Fix this by adding a tsc_early_khz command line parameter that makes the kernel skip early TSC calibration and use the given value instead. This allows the user to provide the expected TSC frequency that is closer to reality than the one reported by the hardware, enabling tsc_refine_calibration_work to do meaningful error checking. [ tglx: Made the variable __initdata as it's only used on init and removed the error checking in the argument parser because kstrto*() only stores to the variable if the string is valid ] Signed-off-by: Krzysztof Piecuch <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lkml.kernel.org/r/O2CpIOrqLZHgNRkfjRpz_LGqnc1ix_seNIiOCvHY4RHoulOVRo6kMXKuLOfBVTi0SMMevg6Go1uZ_cL9fLYtYdTRNH78ChaFaZyG3VAyYz8=@protonmail.com
2020-05-20x86/gpu: add RKL stolen memory supportMatt Roper1-0/+1
RKL re-uses the same stolen memory registers as TGL and ICL. Bspec: 52055 Bspec: 49589 Bspec: 49636 Cc: Lucas De Marchi <[email protected]> Signed-off-by: Matt Roper <[email protected]> Reviewed-by: Anusha Srivatsa <[email protected]> Acked-by: Borislav Petkov <[email protected]> Signed-off-by: Lucas De Marchi <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2020-05-19x86/audit: Fix a -Wmissing-prototypes warning for ia32_classify_syscall()Benjamin Thiel1-1/+1
Lift the prototype of ia32_classify_syscall() into its own header. Signed-off-by: Benjamin Thiel <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-05-19x86/kvm: Restrict ASYNC_PF to user spaceThomas Gleixner1-93/+7
The async page fault injection into kernel space creates more problems than it solves. The host has absolutely no knowledge about the state of the guest if the fault happens in CPL0. The only restriction for the host is interrupt disabled state. If interrupts are enabled in the guest then the exception can hit arbitrary code. The HALT based wait in non-preemotible code is a hacky replacement for a proper hypercall. For the ongoing work to restrict instrumentation and make the RCU idle interaction well defined the required extra work for supporting async pagefault in CPL0 is just not justified and creates complexity for a dubious benefit. The CPL3 injection is well defined and does not cause any issues as it is more or less the same as a regular page fault from CPL3. Suggested-by: Andy Lutomirski <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Alexandre Chartre <[email protected]> Acked-by: Paolo Bonzini <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-05-19x86/kvm: Sanitize kvm_async_pf_task_wait()Thomas Gleixner1-60/+141
While working on the entry consolidation I stumbled over the KVM async page fault handler and kvm_async_pf_task_wait() in particular. It took me a while to realize that the randomly sprinkled around rcu_irq_enter()/exit() invocations are just cargo cult programming. Several patches "fixed" RCU splats by curing the symptoms without noticing that the code is flawed from a design perspective. The main problem is that this async injection is not based on a proper handshake mechanism and only respects the minimal requirement, i.e. the guest is not in a state where it has interrupts disabled. Aside of that the actual code is a convoluted one fits it all swiss army knife. It is invoked from different places with different RCU constraints: 1) Host side: vcpu_enter_guest() kvm_x86_ops->handle_exit() kvm_handle_page_fault() kvm_async_pf_task_wait() The invocation happens from fully preemptible context. 2) Guest side: The async page fault interrupted: a) user space b) preemptible kernel code which is not in a RCU read side critical section c) non-preemtible kernel code or a RCU read side critical section or kernel code with CONFIG_PREEMPTION=n which allows not to differentiate between #2b and #2c. RCU is watching for: #1 The vCPU exited and current is definitely not the idle task #2a The #PF entry code on the guest went through enter_from_user_mode() which reactivates RCU #2b There is no preemptible, interrupts enabled code in the kernel which can run with RCU looking away. (The idle task is always non preemptible). I.e. all schedulable states (#1, #2a, #2b) do not need any of this RCU voodoo at all. In #2c RCU is eventually not watching, but as that state cannot schedule anyway there is no point to worry about it so it has to invoke rcu_irq_enter() before running that code. This can be optimized, but this will be done as an extra step in course of the entry code consolidation work. So the proper solution for this is to: - Split kvm_async_pf_task_wait() into schedule and halt based waiting interfaces which share the enqueueing code. - Add comments (condensed form of this changelog) to spare others the time waste and pain of reverse engineering all of this with the help of uncomprehensible changelogs and code history. - Invoke kvm_async_pf_task_wait_schedule() from kvm_handle_page_fault(), user mode and schedulable kernel side async page faults (#1, #2a, #2b) - Invoke kvm_async_pf_task_wait_halt() for the non schedulable kernel case (#2c). For this case also remove the rcu_irq_exit()/enter() pair around the halt as it is just a pointless exercise: - vCPUs can VMEXIT at any random point and can be scheduled out for an arbitrary amount of time by the host and this is not any different except that it voluntary triggers the exit via halt. - The interrupted context could have RCU watching already. So the rcu_irq_exit() before the halt is not gaining anything aside of confusing the reader. Claiming that this might prevent RCU stalls is just an illusion. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Alexandre Chartre <[email protected]> Acked-by: Paolo Bonzini <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-05-19x86/kvm: Handle async page faults directly through do_page_fault()Andy Lutomirski3-21/+21
KVM overloads #PF to indicate two types of not-actually-page-fault events. Right now, the KVM guest code intercepts them by modifying the IDT and hooking the #PF vector. This makes the already fragile fault code even harder to understand, and it also pollutes call traces with async_page_fault and do_async_page_fault for normal page faults. Clean it up by moving the logic into do_page_fault() using a static branch. This gets rid of the platform trap_init override mechanism completely. [ tglx: Fixed up 32bit, removed error code from the async functions and massaged coding style ] Signed-off-by: Andy Lutomirski <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Alexandre Chartre <[email protected]> Acked-by: Paolo Bonzini <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-05-19x86: Replace ist_enter() with nmi_enter()Peter Zijlstra4-62/+24
A few exceptions (like #DB and #BP) can happen at any location in the code, this then means that tracers should treat events from these exceptions as NMI-like. The interrupted context could be holding locks with interrupts disabled for instance. Similarly, #MC is an actual NMI-like exception. All of them use ist_enter() which only concerns itself with RCU, but does not do any of the other setup that NMIs need. This means things like: printk() raw_spin_lock_irq(&logbuf_lock); <#DB/#BP/#MC> printk() raw_spin_lock_irq(&logbuf_lock); are entirely possible (well, not really since printk tries hard to play nice, but the concept stands). So replace ist_enter() with nmi_enter(). Also observe that any nmi_enter() caller must be both notrace and NOKPROBE, or in the noinstr text section. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Alexandre Chartre <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-05-19x86/mce: Send #MC singal from task workPeter Zijlstra1-25/+31
Convert #MC over to using task_work_add(); it will run the same code slightly later, on the return to user path of the same exception. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Frederic Weisbecker <[email protected]> Reviewed-by: Alexandre Chartre <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-05-19x86/entry: Get rid of ist_begin/end_non_atomic()Thomas Gleixner2-39/+4
This is completely overengineered and definitely not an interface which should be made available to anything else than this particular MCE case. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Alexandre Chartre <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-05-18Merge tag 'v5.7-rc6' into objtool/core, to pick up fixes and resolve ↵Ingo Molnar6-72/+168
semantic conflict Resolve structural conflict between: 59566b0b622e: ("x86/ftrace: Have ftrace trampolines turn read-only at the end of system boot up") which introduced a new reference to 'ftrace_epilogue', and: 0298739b7983: ("x86,ftrace: Fix ftrace_regs_caller() unwind") Which renamed it to 'ftrace_caller_end'. Rename the new usage site in the merge commit. Signed-off-by: Ingo Molnar <[email protected]>
2020-05-17Merge tag 'objtool-urgent-2020-05-17' of ↵Linus Torvalds1-7/+9
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 stack unwinding fix from Thomas Gleixner: "A single bugfix for the ORC unwinder to ensure that the error flag which tells the unwinding code whether a stack trace can be trusted or not is always set correctly. This was messed up by a couple of changes in the recent past" * tag 'objtool-urgent-2020-05-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/unwind/orc: Fix error handling in __unwind_start()
2020-05-17Merge tag 'x86_urgent_for_v5.7-rc7' of ↵Linus Torvalds1-0/+8
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fix from Borislav Petkov: "A single fix for early boot crashes of kernels built with gcc10 and stack protector enabled" * tag 'x86_urgent_for_v5.7-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86: Fix early boot crash on gcc-10, third try
2020-05-16x86/fpu/xstate: Restore supervisor states for signal returnYu-cheng Yu1-5/+39
The signal return fast path directly restores user states from the user buffer. Once that succeeds, restore supervisor states (but only when they are not yet restored). For the slow path, save supervisor states to preserve them across context switches, and restore after the user states are restored. The previous version has the overhead of an XSAVES in both the fast and the slow paths. It is addressed as the following: - In the fast path, only do an XRSTORS. - In the slow path, do a supervisor-state-only XSAVES, and relocate the buffer contents. Some thoughts in the implementation: - In the slow path, can any supervisor state become stale between save/restore? Answer: set_thread_flag(TIF_NEED_FPU_LOAD) protects the xstate buffer. - In the slow path, can any code reference a stale supervisor state register between save/restore? Answer: In the current lazy-restore scheme, any reference to xstate registers needs fpregs_lock()/fpregs_unlock() and __fpregs_load_activate(). - Are there other options? One other option is eagerly restoring all supervisor states. Currently, CET user-mode states and ENQCMD's PASID do not need to be eagerly restored. The upcoming CET kernel-mode states (24 bytes) need to be eagerly restored. To me, eagerly restoring all supervisor states adds more overhead then benefit at this point. Signed-off-by: Yu-cheng Yu <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Dave Hansen <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-05-16x86/fpu/xstate: Preserve supervisor states for the slow path in ↵Yu-cheng Yu1-25/+28
__fpu__restore_sig() The signal return code is responsible for taking an XSAVE buffer present in user memory and loading it into the hardware registers. This operation only affects user XSAVE state and never affects supervisor state. The fast path through this code simply points XRSTOR directly at the user buffer. However, since user memory is not guaranteed to be always mapped, this XRSTOR can fail. If it fails, the signal return code falls back to a slow path which can tolerate page faults. That slow path copies the xfeatures one by one out of the user buffer into the task's fpu state area. However, by being in a context where it can handle page faults, the code can also schedule. The lazy-fpu-load code would think it has an up-to-date fpstate and would fail to save the supervisor state when scheduling the task out. When scheduling back in, it would likely restore stale supervisor state. To fix that, preserve supervisor state before the slow path. Modify copy_user_to_fpregs_zeroing() so that if it fails, fpregs are not zeroed, and there is no need for fpregs_deactivate() and supervisor states are preserved. Move set_thread_flag(TIF_NEED_FPU_LOAD) to the slow path. Without doing this, the fast path also needs supervisor states to be saved first. Signed-off-by: Yu-cheng Yu <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-05-16x86/fpu: Introduce copy_supervisor_to_kernel()Yu-cheng Yu1-0/+84
The XSAVES instruction takes a mask and saves only the features specified in that mask. The kernel normally specifies that all features be saved. XSAVES also unconditionally uses the "compacted format" which means that all specified features are saved next to each other in memory. If a feature is removed from the mask, all the features after it will "move up" into earlier locations in the buffer. Introduce copy_supervisor_to_kernel(), which saves only supervisor states and then moves those states into the standard location where they are normally found. Signed-off-by: Yu-cheng Yu <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-05-16x86/nmi: Remove edac.h include leftoverBorislav Petkov1-4/+0
... which db47d5f85646 ("x86/nmi, EDAC: Get rid of DRAM error reporting thru PCI SERR NMI") forgot to remove. No functional changes. Signed-off-by: Borislav Petkov <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-05-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller5-54/+121
Move the bpf verifier trace check into the new switch statement in HEAD. Resolve the overlapping changes in hinic, where bug fixes overlap the addition of VF support. Signed-off-by: David S. Miller <[email protected]>
2020-05-15x86: Fix early boot crash on gcc-10, third tryBorislav Petkov1-0/+8
... or the odyssey of trying to disable the stack protector for the function which generates the stack canary value. The whole story started with Sergei reporting a boot crash with a kernel built with gcc-10: Kernel panic — not syncing: stack-protector: Kernel stack is corrupted in: start_secondary CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.6.0-rc5—00235—gfffb08b37df9 #139 Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./H77M—D3H, BIOS F12 11/14/2013 Call Trace: dump_stack panic ? start_secondary __stack_chk_fail start_secondary secondary_startup_64 -—-[ end Kernel panic — not syncing: stack—protector: Kernel stack is corrupted in: start_secondary This happens because gcc-10 tail-call optimizes the last function call in start_secondary() - cpu_startup_entry() - and thus emits a stack canary check which fails because the canary value changes after the boot_init_stack_canary() call. To fix that, the initial attempt was to mark the one function which generates the stack canary with: __attribute__((optimize("-fno-stack-protector"))) ... start_secondary(void *unused) however, using the optimize attribute doesn't work cumulatively as the attribute does not add to but rather replaces previously supplied optimization options - roughly all -fxxx options. The key one among them being -fno-omit-frame-pointer and thus leading to not present frame pointer - frame pointer which the kernel needs. The next attempt to prevent compilers from tail-call optimizing the last function call cpu_startup_entry(), shy of carving out start_secondary() into a separate compilation unit and building it with -fno-stack-protector, was to add an empty asm(""). This current solution was short and sweet, and reportedly, is supported by both compilers but we didn't get very far this time: future (LTO?) optimization passes could potentially eliminate this, which leads us to the third attempt: having an actual memory barrier there which the compiler cannot ignore or move around etc. That should hold for a long time, but hey we said that about the other two solutions too so... Reported-by: Sergei Trofimovich <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Tested-by: Kalle Valo <[email protected]> Cc: <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-05-15x86/unwind/orc: Fix error handling in __unwind_start()Josh Poimboeuf1-7/+9
The unwind_state 'error' field is used to inform the reliable unwinding code that the stack trace can't be trusted. Set this field for all errors in __unwind_start(). Also, move the zeroing out of the unwind_state struct to before the ORC table initialization check, to prevent the caller from reading uninitialized data if the ORC table is corrupted. Fixes: af085d9084b4 ("stacktrace/x86: add function for detecting reliable stack traces") Fixes: d3a09104018c ("x86/unwinder/orc: Dont bail on stack overflow") Fixes: 98d0c8ebf77e ("x86/unwind/orc: Prevent unwinding before ORC initialization") Reported-by: Pavel Machek <[email protected]> Signed-off-by: Josh Poimboeuf <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/d6ac7215a84ca92b895fdd2e1aa546729417e6e6.1589487277.git.jpoimboe@redhat.com
2020-05-14Merge tag 'trace-v5.7-rc5' of ↵Linus Torvalds1-1/+28
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull more tracing fixes from Steven Rostedt: "Various tracing fixes: - Fix a crash when having function tracing and function stack tracing on the command line. The ftrace trampolines are created as executable and read only. But the stack tracer tries to modify them with text_poke() which expects all kernel text to still be writable at boot. Keep the trampolines writable at boot, and convert them to read-only with the rest of the kernel. - A selftest was triggering in the ring buffer iterator code, that is no longer valid with the update of keeping the ring buffer writable while a iterator is reading. Just bail after three failed attempts to get an event and remove the warning and disabling of the ring buffer. - While modifying the ring buffer code, decided to remove all the unnecessary BUG() calls" * tag 'trace-v5.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: ring-buffer: Remove all BUG() calls ring-buffer: Don't deactivate the ring buffer on failed iterator reads x86/ftrace: Have ftrace trampolines turn read-only at the end of system boot up
2020-05-13x86/fpu/xstate: Update sanitize_restored_xstate() for supervisor xstatesYu-cheng Yu1-13/+24
The function sanitize_restored_xstate() sanitizes user xstates of an XSAVE buffer by clearing bits not in the input 'xfeatures' from the buffer's header->xfeatures, effectively resetting those features back to the init state. When supervisor xstates are introduced, it is necessary to make sure only user xstates are sanitized. Ensure supervisor bits in header->xfeatures stay set and supervisor states are not modified. To make names clear, also: - Rename the function to sanitize_restored_user_xstate(). - Rename input parameter 'xfeatures' to 'user_xfeatures'. - In __fpu__restore_sig(), rename 'xfeatures' to 'user_xfeatures'. Signed-off-by: Yu-cheng Yu <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Dave Hansen <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-05-13x86/fpu/xstate: Define new functions for clearing fpregs and xstatesFenghua Yu4-22/+39
Currently, fpu__clear() clears all fpregs and xstates. Once XSAVES supervisor states are introduced, supervisor settings (e.g. CET xstates) must remain active for signals; It is necessary to have separate functions: - Create fpu__clear_user_states(): clear only user settings for signals; - Create fpu__clear_all(): clear both user and supervisor settings in flush_thread(). Also modify copy_init_fpstate_to_fpregs() to take a mask from above two functions. Remove obvious side-comment in fpu__clear(), while at it. [ bp: Make the second argument of fpu__clear() bool after requesting it a bunch of times during review. - Add a comment about copy_init_fpstate_to_fpregs() locking needs. ] Co-developed-by: Yu-cheng Yu <[email protected]> Signed-off-by: Fenghua Yu <[email protected]> Signed-off-by: Yu-cheng Yu <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Dave Hansen <[email protected]> Reviewed-by: Tony Luck <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-05-13x86/fpu/xstate: Introduce XSAVES supervisor statesYu-cheng Yu1-9/+19
Enable XSAVES supervisor states by setting MSR_IA32_XSS bits according to CPUID enumeration results. Also revise comments at various places. Co-developed-by: Fenghua Yu <[email protected]> Signed-off-by: Fenghua Yu <[email protected]> Signed-off-by: Yu-cheng Yu <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Dave Hansen <[email protected]> Reviewed-by: Tony Luck <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-05-13x86/fpu/xstate: Separate user and supervisor xfeatures maskYu-cheng Yu2-35/+54
Before the introduction of XSAVES supervisor states, 'xfeatures_mask' is used at various places to determine XSAVE buffer components and XCR0 bits. It contains only user xstates. To support supervisor xstates, it is necessary to separate user and supervisor xstates: - First, change 'xfeatures_mask' to 'xfeatures_mask_all', which represents the full set of bits that should ever be set in a kernel XSAVE buffer. - Introduce xfeatures_mask_supervisor() and xfeatures_mask_user() to extract relevant xfeatures from xfeatures_mask_all. Co-developed-by: Fenghua Yu <[email protected]> Signed-off-by: Fenghua Yu <[email protected]> Signed-off-by: Yu-cheng Yu <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Dave Hansen <[email protected]> Reviewed-by: Tony Luck <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-05-12x86/ftrace: Have ftrace trampolines turn read-only at the end of system boot upSteven Rostedt (VMware)1-1/+28
Booting one of my machines, it triggered the following crash: Kernel/User page tables isolation: enabled ftrace: allocating 36577 entries in 143 pages Starting tracer 'function' BUG: unable to handle page fault for address: ffffffffa000005c #PF: supervisor write access in kernel mode #PF: error_code(0x0003) - permissions violation PGD 2014067 P4D 2014067 PUD 2015063 PMD 7b253067 PTE 7b252061 Oops: 0003 [#1] PREEMPT SMP PTI CPU: 0 PID: 0 Comm: swapper Not tainted 5.4.0-test+ #24 Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007 RIP: 0010:text_poke_early+0x4a/0x58 Code: 34 24 48 89 54 24 08 e8 bf 72 0b 00 48 8b 34 24 48 8b 4c 24 08 84 c0 74 0b 48 89 df f3 a4 48 83 c4 10 5b c3 9c 58 fa 48 89 df <f3> a4 50 9d 48 83 c4 10 5b e9 d6 f9 ff ff 0 41 57 49 RSP: 0000:ffffffff82003d38 EFLAGS: 00010046 RAX: 0000000000000046 RBX: ffffffffa000005c RCX: 0000000000000005 RDX: 0000000000000005 RSI: ffffffff825b9a90 RDI: ffffffffa000005c RBP: ffffffffa000005c R08: 0000000000000000 R09: ffffffff8206e6e0 R10: ffff88807b01f4c0 R11: ffffffff8176c106 R12: ffffffff8206e6e0 R13: ffffffff824f2440 R14: 0000000000000000 R15: ffffffff8206eac0 FS: 0000000000000000(0000) GS:ffff88807d400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffa000005c CR3: 0000000002012000 CR4: 00000000000006b0 Call Trace: text_poke_bp+0x27/0x64 ? mutex_lock+0x36/0x5d arch_ftrace_update_trampoline+0x287/0x2d5 ? ftrace_replace_code+0x14b/0x160 ? ftrace_update_ftrace_func+0x65/0x6c __register_ftrace_function+0x6d/0x81 ftrace_startup+0x23/0xc1 register_ftrace_function+0x20/0x37 func_set_flag+0x59/0x77 __set_tracer_option.isra.19+0x20/0x3e trace_set_options+0xd6/0x13e apply_trace_boot_options+0x44/0x6d register_tracer+0x19e/0x1ac early_trace_init+0x21b/0x2c9 start_kernel+0x241/0x518 ? load_ucode_intel_bsp+0x21/0x52 secondary_startup_64+0xa4/0xb0 I was able to trigger it on other machines, when I added to the kernel command line of both "ftrace=function" and "trace_options=func_stack_trace". The cause is the "ftrace=function" would register the function tracer and create a trampoline, and it will set it as executable and read-only. Then the "trace_options=func_stack_trace" would then update the same trampoline to include the stack tracer version of the function tracer. But since the trampoline already exists, it updates it with text_poke_bp(). The problem is that text_poke_bp() called while system_state == SYSTEM_BOOTING, it will simply do a memcpy() and not the page mapping, as it would think that the text is still read-write. But in this case it is not, and we take a fault and crash. Instead, lets keep the ftrace trampolines read-write during boot up, and then when the kernel executable text is set to read-only, the ftrace trampolines get set to read-only as well. Link: https://lkml.kernel.org/r/[email protected] Cc: Ingo Molnar <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Josh Poimboeuf <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: [email protected] Fixes: 768ae4406a5c ("x86/ftrace: Use text_poke()") Acked-by: Peter Zijlstra <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2020-05-12x86/fpu/xstate: Define new macros for supervisor and user xstatesFenghua Yu2-14/+15
XCNTXT_MASK is 'all supported xfeatures' before introducing supervisor xstates. Rename it to XFEATURE_MASK_USER_SUPPORTED to make clear that these are user xstates. Replace XFEATURE_MASK_SUPERVISOR with the following: - XFEATURE_MASK_SUPERVISOR_SUPPORTED: Currently nothing. ENQCMD and Control-flow Enforcement Technology (CET) will be introduced in separate series. - XFEATURE_MASK_SUPERVISOR_UNSUPPORTED: Currently only Processor Trace. - XFEATURE_MASK_SUPERVISOR_ALL: the combination of above. Co-developed-by: Yu-cheng Yu <[email protected]> Signed-off-by: Fenghua Yu <[email protected]> Signed-off-by: Yu-cheng Yu <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Dave Hansen <[email protected]> Reviewed-by: Tony Luck <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-05-12x86/fpu/xstate: Rename validate_xstate_header() to validate_user_xstate_header()Fenghua Yu3-5/+5
The function validate_xstate_header() validates an xstate header coming from userspace (PTRACE or sigreturn). To make it clear, rename it to validate_user_xstate_header(). Suggested-by: Dave Hansen <[email protected]> Signed-off-by: Fenghua Yu <[email protected]> Signed-off-by: Yu-cheng Yu <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Dave Hansen <[email protected]> Reviewed-by: Tony Luck <[email protected]> Reviewed-by: Borislav Petkov <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-05-10Merge tag 'x86-urgent-2020-05-10' of ↵Linus Torvalds4-53/+93
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: "A set of fixes for x86: - Ensure that direct mapping alias is always flushed when changing page attributes. The optimization for small ranges failed to do so when the virtual address was in the vmalloc or module space. - Unbreak the trace event registration for syscalls without arguments caused by the refactoring of the SYSCALL_DEFINE0() macro. - Move the printk in the TSC deadline timer code to a place where it is guaranteed to only be called once during boot and cannot be rearmed by clearing warn_once after boot. If it's invoked post boot then lockdep rightfully complains about a potential deadlock as the calling context is different. - A series of fixes for objtool and the ORC unwinder addressing variety of small issues: - Stack offset tracking for indirect CFAs in objtool ignored subsequent pushs and pops - Repair the unwind hints in the register clearing entry ASM code - Make the unwinding in the low level exit to usermode code stop after switching to the trampoline stack. The unwind hint is no longer valid and the ORC unwinder emits a warning as it can't find the registers anymore. - Fix unwind hints in switch_to_asm() and rewind_stack_do_exit() which caused objtool to generate bogus ORC data. - Prevent unwinder warnings when dumping the stack of a non-current task as there is no way to be sure about the validity because the dumped stack can be a moving target. - Make the ORC unwinder behave the same way as the frame pointer unwinder when dumping an inactive tasks stack and do not skip the first frame. - Prevent ORC unwinding before ORC data has been initialized - Immediately terminate unwinding when a unknown ORC entry type is found. - Prevent premature stop of the unwinder caused by IRET frames. - Fix another infinite loop in objtool caused by a negative offset which was not catched. - Address a few build warnings in the ORC unwinder and add missing static/ro_after_init annotations" * tag 'x86-urgent-2020-05-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/unwind/orc: Move ORC sorting variables under !CONFIG_MODULES x86/apic: Move TSC deadline timer debug printk ftrace/x86: Fix trace event registration for syscalls without arguments x86/mm/cpa: Flush direct map alias during cpa objtool: Fix infinite loop in for_offset_range() x86/unwind/orc: Fix premature unwind stoppage due to IRET frames x86/unwind/orc: Fix error path for bad ORC entry type x86/unwind/orc: Prevent unwinding before ORC initialization x86/unwind/orc: Don't skip the first frame for inactive tasks x86/unwind: Prevent false warnings for non-current tasks x86/unwind/orc: Convert global variables to static x86/entry/64: Fix unwind hints in rewind_stack_do_exit() x86/entry/64: Fix unwind hints in __switch_to_asm() x86/entry/64: Fix unwind hints in kernel exit path x86/entry/64: Fix unwind hints in register clearing code objtool: Fix stack offset tracking for indirect CFAs
2020-05-08x86/module: Use text_mutex in apply_relocate_add()Josh Poimboeuf1-2/+7
Now that the livepatch code no longer needs the text_mutex for changing module permissions, move its usage down to apply_relocate_add(). Note the s390 version of apply_relocate_add() doesn't need to use the text_mutex because it already uses s390_kernel_write_lock, which accomplishes the same task. Signed-off-by: Josh Poimboeuf <[email protected]> Acked-by: Joe Lawrence <[email protected]> Acked-by: Miroslav Benes <[email protected]> Signed-off-by: Jiri Kosina <[email protected]>
2020-05-08x86/module: Use text_poke() for late relocationsPeter Zijlstra1-7/+31
Because of late module patching, a livepatch module needs to be able to apply some of its relocations well after it has been loaded. Instead of playing games with module_{dis,en}able_ro(), use existing text poking mechanisms to apply relocations after module loading. So far only x86, s390 and Power have HAVE_LIVEPATCH but only the first two also have STRICT_MODULE_RWX. This will allow removal of the last module_disable_ro() usage in livepatch. The ultimate goal is to completely disallow making executable mappings writable. [ jpoimboe: Split up patches. Use mod state to determine whether memcpy() can be used. Implement text_poke() for UML. ] Cc: [email protected] Suggested-by: Josh Poimboeuf <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Josh Poimboeuf <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Joe Lawrence <[email protected]> Acked-by: Miroslav Benes <[email protected]> Signed-off-by: Jiri Kosina <[email protected]>
2020-05-08livepatch: Remove .klp.archPeter Zijlstra2-54/+0
After the previous patch, vmlinux-specific KLP relocations are now applied early during KLP module load. This means that .klp.arch sections are no longer needed for *vmlinux-specific* KLP relocations. One might think they're still needed for *module-specific* KLP relocations. If a to-be-patched module is loaded *after* its corresponding KLP module is loaded, any corresponding KLP relocations will be delayed until the to-be-patched module is loaded. If any special sections (.parainstructions, for example) rely on those relocations, their initializations (apply_paravirt) need to be done afterwards. Thus the apparent need for arch_klp_init_object_loaded() and its corresponding .klp.arch sections -- it allows some of the special section initializations to be done at a later time. But... if you look closer, that dependency between the special sections and the module-specific KLP relocations doesn't actually exist in reality. Looking at the contents of the .altinstructions and .parainstructions sections, there's not a realistic scenario in which a KLP module's .altinstructions or .parainstructions section needs to access a symbol in a to-be-patched module. It might need to access a local symbol or even a vmlinux symbol; but not another module's symbol. When a special section needs to reference a local or vmlinux symbol, a normal rela can be used instead of a KLP rela. Since the special section initializations don't actually have any real dependency on module-specific KLP relocations, .klp.arch and arch_klp_init_object_loaded() no longer have a reason to exist. So remove them. As Peter said much more succinctly: So the reason for .klp.arch was that .klp.rela.* stuff would overwrite paravirt instructions. If that happens you're doing it wrong. Those RELAs are core kernel, not module, and thus should've happened in .rela.* sections at patch-module loading time. Reverting this removes the two apply_{paravirt,alternatives}() calls from the late patching path, and means we don't have to worry about them when removing module_disable_ro(). [ jpoimboe: Rewrote patch description. Tweaked klp_init_object_loaded() error path. ] Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Josh Poimboeuf <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Joe Lawrence <[email protected]> Acked-by: Miroslav Benes <[email protected]> Signed-off-by: Jiri Kosina <[email protected]>
2020-05-07x86/cpu/amd: Make erratum #1054 a legacy erratumKim Phillips1-2/+1
Commit 21b5ee59ef18 ("x86/cpu/amd: Enable the fixed Instructions Retired counter IRPERF") mistakenly added erratum #1054 as an OS Visible Workaround (OSVW) ID 0. Erratum #1054 is not OSVW ID 0 [1], so make it a legacy erratum. There would never have been a false positive on older hardware that has OSVW bit 0 set, since the IRPERF feature was not available. However, save a couple of RDMSR executions per thread, on modern system configurations that correctly set non-zero values in their OSVW_ID_Length MSRs. [1] Revision Guide for AMD Family 17h Models 00h-0Fh Processors. The revision guide is available from the bugzilla link below. Fixes: 21b5ee59ef18 ("x86/cpu/amd: Enable the fixed Instructions Retired counter IRPERF") Reported-by: Andrew Cooper <[email protected]> Signed-off-by: Kim Phillips <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
2020-05-07x86/delay: Introduce TPAUSE delayKyung Min Park1-0/+3
TPAUSE instructs the processor to enter an implementation-dependent optimized state. The instruction execution wakes up when the time-stamp counter reaches or exceeds the implicit EDX:EAX 64-bit input value. The instruction execution also wakes up due to the expiration of the operating system time-limit or by an external interrupt or exceptions such as a debug exception or a machine check exception. TPAUSE offers a choice of two lower power states: 1. Light-weight power/performance optimized state C0.1 2. Improved power/performance optimized state C0.2 This way, it can save power with low wake-up latency in comparison to spinloop based delay. The selection between the two is governed by the input register. TPAUSE is available on processors with X86_FEATURE_WAITPKG. Co-developed-by: Fenghua Yu <[email protected]> Signed-off-by: Fenghua Yu <[email protected]> Signed-off-by: Kyung Min Park <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Tony Luck <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-05-07x86/platform/uv: Unexport uv_apicid_hibitsChristoph Hellwig1-1/+0
This variable is not used by modular code. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-05-07x86/platform/uv: Remove _uv_hub_info_check()Christoph Hellwig1-6/+0
Neither this functions nor the helpers used to implement it are used anywhere in the kernel tree. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Not-acked-by: Dimitri Sivanich <[email protected]> Cc: Russ Anderson <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-05-07x86/platform/uv: Simplify uv_send_IPI_one()Christoph Hellwig1-5/+14
Merge two helpers only used by uv_send_IPI_one() into the main function. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Not-acked-by: Dimitri Sivanich <[email protected]> Cc: Russ Anderson <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-05-07x86/platform/uv: Mark uv_min_hub_revision_id staticChristoph Hellwig1-2/+1
This variable is only used inside x2apic_uv_x and not even declared in a header. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Not-acked-by: Dimitri Sivanich <[email protected]> Cc: Russ Anderson <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-05-07x86/platform/uv: Mark is_uv_hubless() staticChristoph Hellwig1-2/+1
is_uv_hubless() is only used in x2apic_uv_x.c. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Not-acked-by: Dimitri Sivanich <[email protected]> Cc: Russ Anderson <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-05-07cpu/hotplug: Remove disable_nonboot_cpus()Qais Yousef1-2/+2
The single user could have called freeze_secondary_cpus() directly. Since this function was a source of confusion, remove it as it's just a pointless wrapper. While at it, rename enable_nonboot_cpus() to thaw_secondary_cpus() to preserve the naming symmetry. Done automatically via: git grep -l enable_nonboot_cpus | xargs sed -i 's/enable_nonboot_cpus/thaw_secondary_cpus/g' Signed-off-by: Qais Yousef <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-05-07x86/apic: Convert the TSC deadline timer matching to steppings macroBorislav Petkov1-45/+12
... and get rid of the function pointers which would spit out the microcode revision based on the CPU stepping. Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Mark Gross <mgross.linux.intel.com> Cc: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]