aboutsummaryrefslogtreecommitdiff
path: root/arch/x86/kernel
AgeCommit message (Collapse)AuthorFilesLines
2020-08-03Merge tag 'irq-urgent-2020-08-02' of ↵Linus Torvalds2-0/+9
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Ingo Molnar: "Fix a recent IRQ affinities regression, add in a missing debugfs printout that helps the debugging of IRQ affinity logic bugs, and fix a memory leak" * tag 'irq-urgent-2020-08-02' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq/debugfs: Add missing irqchip flags genirq/affinity: Make affinity setting if activated opt-in irqdomain/treewide: Free firmware node after domain removal
2020-07-31Merge branch 'linus' into locking/core, to resolve conflictIngo Molnar5-18/+25
Conflicts: arch/arm/include/asm/percpu.h As Stephen Rothwell noted, there's a conflict between this commit in locking/core: a21ee6055c30 ("lockdep: Change hardirq{s_enabled,_context} to per-cpu variables") and this fresh upstream commit: aa54ea903abb ("ARM: percpu.h: fix build error") a21ee6055c30 is a simpler solution to the dependency problem and doesn't further increase header hell - so this conflict resolution effectively reverts aa54ea903abb and uses the a21ee6055c30 solution. Signed-off-by: Ingo Molnar <[email protected]>
2020-07-30initrd: remove support for multiple floppiesChristoph Hellwig1-2/+0
Remove the special handling for multiple floppies in the initrd code. No one should be using floppies for booting these days. (famous last words..) Includes a spelling fix from Colin Ian King <[email protected]>. Signed-off-by: Christoph Hellwig <[email protected]> Acked-by: Linus Torvalds <[email protected]>
2020-07-29x86/i8259: Use printk_deferred() to prevent deadlockThomas Gleixner1-1/+1
0day reported a possible circular locking dependency: Chain exists of: &irq_desc_lock_class --> console_owner --> &port_lock_key Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&port_lock_key); lock(console_owner); lock(&port_lock_key); lock(&irq_desc_lock_class); The reason for this is a printk() in the i8259 interrupt chip driver which is invoked with the irq descriptor lock held, which reverses the lock operations vs. printk() from arbitrary contexts. Switch the printk() to printk_deferred() to avoid that. Reported-by: kernel test robot <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/r/[email protected]
2020-07-28Merge tag 'v5.8-rc7' into perf/core, to pick up fixesIngo Molnar15-68/+110
Signed-off-by: Ingo Molnar <[email protected]>
2020-07-27x86: switch to ->regset_get()Al Viro6-241/+74
All instances of ->get() in arch/x86 switched; that might or might not be worth splitting up. Notes: * for xstateregs_get() the amount we want to store is determined at the boot time; see init_xstate_size() and update_regset_xstate_info() for details. task->thread.fpu.state.xsave ends with a flexible array member and the amount of data in it depends upon the FPU features supported/enabled. * fpregs_get() writes slightly less than full ->thread.fpu.state.fsave (the last word is not copied); we pass the full size of state.fsave and let membuf_write() trim to the amount declared by regset - __regset_get() will make sure that the space in buffer is no more than that. * copy_xstate_to_user() and its helpers are gone now. * fpregs_soft_get() was getting user_regset_copyout() arguments wrong. Since "x86: x86 user_regset math_emu" back in 2008... I really doubt that it's worth splitting out for -stable, though - you need a 486SX box for that to trigger... [Kevin's braino fix for copy_xstate_to_kernel() essentially duplicated here] Signed-off-by: Al Viro <[email protected]>
2020-07-27genirq/affinity: Make affinity setting if activated opt-inThomas Gleixner1-0/+4
John reported that on a RK3288 system the perf per CPU interrupts are all affine to CPU0 and provided the analysis: "It looks like what happens is that because the interrupts are not per-CPU in the hardware, armpmu_request_irq() calls irq_force_affinity() while the interrupt is deactivated and then request_irq() with IRQF_PERCPU | IRQF_NOBALANCING. Now when irq_startup() runs with IRQ_STARTUP_NORMAL, it calls irq_setup_affinity() which returns early because IRQF_PERCPU and IRQF_NOBALANCING are set, leaving the interrupt on its original CPU." This was broken by the recent commit which blocked interrupt affinity setting in hardware before activation of the interrupt. While this works in general, it does not work for this particular case. As contrary to the initial analysis not all interrupt chip drivers implement an activate callback, the safe cure is to make the deferred interrupt affinity setting at activation time opt-in. Implement the necessary core logic and make the two irqchip implementations for which this is required opt-in. In hindsight this would have been the right thing to do, but ... Fixes: baedb87d1b53 ("genirq/affinity: Handle affinity setting on inactive interrupts correctly") Reported-by: John Keeping <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Marc Zyngier <[email protected]> Acked-by: Marc Zyngier <[email protected]> Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected]
2020-07-27x86/cpu: Relocate sync_core() to sync_core.hRicardo Neri2-0/+2
Having sync_core() in processor.h is problematic since it is not possible to check for hardware capabilities via the *cpu_has() family of macros. The latter needs the definitions in processor.h. It also looks more intuitive to relocate the function to sync_core.h. This changeset does not make changes in functionality. Signed-off-by: Ricardo Neri <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Reviewed-by: Tony Luck <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-07-27Merge tag 'v5.8-rc7' into x86/cpu, to pick up fixesIngo Molnar5-18/+25
Signed-off-by: Ingo Molnar <[email protected]>
2020-07-26Merge branch 'x86/urgent' into x86/cleanupsIngo Molnar29-170/+202
Refresh the branch for a dependent commit. Signed-off-by: Ingo Molnar <[email protected]>
2020-07-26Merge branch 'locking/nmi' into x86/entryIngo Molnar2-16/+10
Resolve conflicts with ongoing lockdep work that fixed the NMI entry code. Conflicts: arch/x86/entry/common.c arch/x86/include/asm/idtentry.h Signed-off-by: Ingo Molnar <[email protected]>
2020-07-25Merge tag 'v5.8-rc6' into locking/core, to pick up fixesIngo Molnar6-46/+28
Signed-off-by: Ingo Molnar <[email protected]>
2020-07-25x86/split_lock: Enable the split lock feature on Sapphire Rapids and Alder ↵Fenghua Yu1-0/+2
Lake CPUs Add Sapphire Rapids and Alder Lake processors to CPU list to enumerate and enable the split lock feature. Signed-off-by: Fenghua Yu <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Reviewed-by: Tony Luck <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-07-25Merge tag 'v5.8-rc6' into x86/cpu, to refresh the branch before adding new ↵Ingo Molnar25-153/+178
commits Signed-off-by: Ingo Molnar <[email protected]>
2020-07-24x86/entry: Cleanup idtentry_enter/exitThomas Gleixner2-6/+6
Remove the temporary defines and fixup all references. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Kees Cook <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-07-24x86/entry: Cleanup idtentry_entry/exit_userThomas Gleixner2-11/+11
Cleanup the temporary defines and use irqentry_ instead of idtentry_. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Kees Cook <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-07-24x86/entry: Use generic syscall exit functionalityThomas Gleixner1-1/+2
Replace the x86 variant with the generic version. Provide the relevant architecture specific helper functions and defines. Use a temporary define for idtentry_exit_user which will be cleaned up seperately. Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Kees Cook <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-07-24Merge branch 'core/entry' into x86/entryThomas Gleixner5-46/+26
Pick up generic entry code to migrate x86 over.
2020-07-24x86: Correct noinstr qualifiersIra Weiny2-2/+2
The noinstr qualifier is to be specified before the return type in the same way inline is used. These 2 cases were missed by previous patches. Signed-off-by: Ira Weiny <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Tony Luck <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-07-24Merge v5.8-rc6 into drm-nextDave Airlie19-115/+148
I've got a silent conflict + two trees based on fixes to merge. Fixes a silent merge with amdgpu Signed-off-by: Dave Airlie <[email protected]>
2020-07-23irqdomain/treewide: Free firmware node after domain removalJon Derrick1-0/+5
Commit 711419e504eb ("irqdomain: Add the missing assignment of domain->fwnode for named fwnode") unintentionally caused a dangling pointer page fault issue on firmware nodes that were freed after IRQ domain allocation. Commit e3beca48a45b fixed that dangling pointer issue by only freeing the firmware node after an IRQ domain allocation failure. That fix no longer frees the firmware node immediately, but leaves the firmware node allocated after the domain is removed. The firmware node must be kept around through irq_domain_remove, but should be freed it afterwards. Add the missing free operations after domain removal where where appropriate. Fixes: e3beca48a45b ("irqdomain/treewide: Keep firmware node unconditionally allocated") Signed-off-by: Jon Derrick <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Andy Shevchenko <[email protected]> Acked-by: Bjorn Helgaas <[email protected]> # drivers/pci Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected]
2020-07-22x86/dumpstack: Show registers dump with trace's log levelDmitry Safonov1-5/+5
show_trace_log_lvl() provides x86 platform-specific way to unwind backtrace with a given log level. Unfortunately, registers dump(s) are not printed with the same log level - instead, KERN_DEFAULT is always used. Arista's switches uses quite common setup with rsyslog, where only urgent messages goes to console (console_log_level=KERN_ERR), everything else goes into /var/log/ as the console baud-rate often is indecently slow (9600 bps). Backtrace dumps without registers printed have proven to be as useful as morning standups. Furthermore, in order to introduce KERN_UNSUPPRESSED (which I believe is still the most elegant way to fix raciness of sysrq[1]) the log level should be passed down the stack to register dumping functions. Besides, there is a potential use-case for printing traces with KERN_DEBUG level [2] (where registers dump shouldn't appear with higher log level). After all preparations are done, provide log_lvl parameter for show_regs_if_on_stack() and wire up to actual log level used as an argument for show_trace_log_lvl(). [1]: https://lore.kernel.org/lkml/[email protected]/ [2]: https://lore.kernel.org/linux-doc/[email protected]/ Signed-off-by: Dmitry Safonov <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Petr Mladek <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-07-22x86/dumpstack: Add log_lvl to __show_regs()Dmitry Safonov3-42/+47
show_trace_log_lvl() provides x86 platform-specific way to unwind backtrace with a given log level. Unfortunately, registers dump(s) are not printed with the same log level - instead, KERN_DEFAULT is always used. Arista's switches uses quite common setup with rsyslog, where only urgent messages goes to console (console_log_level=KERN_ERR), everything else goes into /var/log/ as the console baud-rate often is indecently slow (9600 bps). Backtrace dumps without registers printed have proven to be as useful as morning standups. Furthermore, in order to introduce KERN_UNSUPPRESSED (which I believe is still the most elegant way to fix raciness of sysrq[1]) the log level should be passed down the stack to register dumping functions. Besides, there is a potential use-case for printing traces with KERN_DEBUG level [2] (where registers dump shouldn't appear with higher log level). Add log_lvl parameter to __show_regs(). Keep the used log level intact to separate visible change. [1]: https://lore.kernel.org/lkml/[email protected]/ [2]: https://lore.kernel.org/linux-doc/[email protected]/ Signed-off-by: Dmitry Safonov <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Petr Mladek <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-07-22x86/dumpstack: Add log_lvl to show_iret_regs()Dmitry Safonov2-5/+5
show_trace_log_lvl() provides x86 platform-specific way to unwind backtrace with a given log level. Unfortunately, registers dump(s) are not printed with the same log level - instead, KERN_DEFAULT is always used. Arista's switches uses quite common setup with rsyslog, where only urgent messages goes to console (console_log_level=KERN_ERR), everything else goes into /var/log/ as the console baud-rate often is indecently slow (9600 bps). Backtrace dumps without registers printed have proven to be as useful as morning standups. Furthermore, in order to introduce KERN_UNSUPPRESSED (which I believe is still the most elegant way to fix raciness of sysrq[1]) the log level should be passed down the stack to register dumping functions. Besides, there is a potential use-case for printing traces with KERN_DEBUG level [2] (where registers dump shouldn't appear with higher log level). Add log_lvl parameter to show_iret_regs() as a preparation to add it to __show_regs() and show_regs_if_on_stack(). [1]: https://lore.kernel.org/lkml/[email protected]/ [2]: https://lore.kernel.org/linux-doc/[email protected]/ Signed-off-by: Dmitry Safonov <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Petr Mladek <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-07-22x86/dumpstack: Dump user space code correctly againThomas Gleixner1-10/+17
H.J. reported that post 5.7 a segfault of a user space task does not longer dump the Code bytes when /proc/sys/debug/exception-trace is enabled. It prints 'Code: Bad RIP value.' instead. This was broken by a recent change which made probe_kernel_read() reject non-kernel addresses. Update show_opcodes() so it retrieves user space opcodes via copy_from_user_nmi(). Fixes: 98a23609b103 ("maccess: always use strict semantics for probe_kernel_read") Reported-by: H.J. Lu <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-07-22x86/stacktrace: Fix reliable check for empty user task stacksJosh Poimboeuf1-5/+0
If a user task's stack is empty, or if it only has user regs, ORC reports it as a reliable empty stack. But arch_stack_walk_reliable() incorrectly treats it as unreliable. That happens because the only success path for user tasks is inside the loop, which only iterates on non-empty stacks. Generally, a user task must end in a user regs frame, but an empty stack is an exception to that rule. Thanks to commit 71c95825289f ("x86/unwind/orc: Fix error handling in __unwind_start()"), unwind_start() now sets state->error appropriately. So now for both ORC and FP unwinders, unwind_done() and !unwind_error() always means the end of the stack was successfully reached. So the success path for kthreads is no longer needed -- it can also be used for empty user tasks. Reported-by: Wang ShaoBo <[email protected]> Signed-off-by: Josh Poimboeuf <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Wang ShaoBo <[email protected]> Link: https://lkml.kernel.org/r/f136a4e5f019219cbc4f4da33b30c2f44fa65b84.1594994374.git.jpoimboe@redhat.com
2020-07-22x86/unwind/orc: Fix ORC for newly forked tasksJosh Poimboeuf1-2/+6
The ORC unwinder fails to unwind newly forked tasks which haven't yet run on the CPU. It correctly reads the 'ret_from_fork' instruction pointer from the stack, but it incorrectly interprets that value as a call stack address rather than a "signal" one, so the address gets incorrectly decremented in the call to orc_find(), resulting in bad ORC data. Fix it by forcing 'ret_from_fork' frames to be signal frames. Reported-by: Wang ShaoBo <[email protected]> Signed-off-by: Josh Poimboeuf <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Wang ShaoBo <[email protected]> Link: https://lkml.kernel.org/r/f91a8778dde8aae7f71884b5df2b16d552040441.1594994374.git.jpoimboe@redhat.com
2020-07-22Merge branch 'sched/urgent'Peter Zijlstra6-46/+28
2020-07-22x86, vmlinux.lds: Page-align end of ..page_aligned sectionsJoerg Roedel1-0/+1
On x86-32 the idt_table with 256 entries needs only 2048 bytes. It is page-aligned, but the end of the .bss..page_aligned section is not guaranteed to be page-aligned. As a result, objects from other .bss sections may end up on the same 4k page as the idt_table, and will accidentially get mapped read-only during boot, causing unexpected page-faults when the kernel writes to them. This could be worked around by making the objects in the page aligned sections page sized, but that's wrong. Explicit sections which store only page aligned objects have an implicit guarantee that the object is alone in the page in which it is placed. That works for all objects except the last one. That's inconsistent. Enforcing page sized objects for these sections would wreckage memory sanitizers, because the object becomes artificially larger than it should be and out of bound access becomes legit. Align the end of the .bss..page_aligned and .data..page_aligned section on page-size so all objects places in these sections are guaranteed to have their own page. [ tglx: Amended changelog ] Signed-off-by: Joerg Roedel <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Kees Cook <[email protected]> Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected]
2020-07-21exec: Implement kernel_execveEric W. Biederman1-1/+1
To allow the kernel not to play games with set_fs to call exec implement kernel_execve. The function kernel_execve takes pointers into kernel memory and copies the values pointed to onto the new userspace stack. The calls with arguments from kernel space of do_execve are replaced with calls to kernel_execve. The calls do_execve and do_execveat are made static as there are now no callers outside of exec. The comments that mention do_execve are updated to refer to kernel_execve or execve depending on the circumstances. In addition to correcting the comments, this makes it easy to grep for do_execve and verify it is not used. Inspired-by: https://lkml.kernel.org/r/[email protected] Reviewed-by: Kees Cook <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: "Eric W. Biederman" <[email protected]>
2020-07-19copy_xstate_to_kernel: Fix typo which caused GDB regressionKevin Buettner1-1/+1
This fixes a regression encountered while running the gdb.base/corefile.exp test in GDB's test suite. In my testing, the typo prevented the sw_reserved field of struct fxregs_state from being output to the kernel XSAVES area. Thus the correct mask corresponding to XCR0 was not present in the core file for GDB to interrogate, resulting in the following behavior: [kev@f32-1 gdb]$ ./gdb -q testsuite/outputs/gdb.base/corefile/corefile testsuite/outputs/gdb.base/corefile/corefile.core Reading symbols from testsuite/outputs/gdb.base/corefile/corefile... [New LWP 232880] warning: Unexpected size of section `.reg-xstate/232880' in core file. With the typo fixed, the test works again as expected. Signed-off-by: Kevin Buettner <[email protected]> Fixes: 9e4636545933 ("copy_xstate_to_kernel(): don't leave parts of destination uninitialized") Cc: Al Viro <[email protected]> Cc: Dave Airlie <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-07-19Merge tag 'x86-urgent-2020-07-19' of ↵Linus Torvalds3-17/+6
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into master Pull x86 fixes from Thomas Gleixner: "A pile of fixes for x86: - Fix the I/O bitmap invalidation on XEN PV, which was overlooked in the recent ioperm/iopl rework. This caused the TSS and XEN's I/O bitmap to get out of sync. - Use the proper vectors for HYPERV. - Make disabling of stack protector for the entry code work with GCC builds which enable stack protector by default. Removing the option is not sufficient, it needs an explicit -fno-stack-protector to shut it off. - Mark check_user_regs() noinstr as it is called from noinstr code. The missing annotation causes it to be placed in the text section which makes it instrumentable. - Add the missing interrupt disable in exc_alignment_check() - Fixup a XEN_PV build dependency in the 32bit entry code - A few fixes to make the Clang integrated assembler happy - Move EFI stub build to the right place for out of tree builds - Make prepare_exit_to_usermode() static. It's not longer called from ASM code" * tag 'x86-urgent-2020-07-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/boot: Don't add the EFI stub to targets x86/entry: Actually disable stack protector x86/ioperm: Fix io bitmap invalidation on Xen PV x86: math-emu: Fix up 'cmp' insn for clang ias x86/entry: Fix vectors to IDTENTRY_SYSVEC for CONFIG_HYPERV x86/entry: Add compatibility with IAS x86/entry/common: Make prepare_exit_to_usermode() static x86/entry: Mark check_user_regs() noinstr x86/traps: Disable interrupts in exc_aligment_check() x86/entry/32: Fix XEN_PV build dependency
2020-07-18x86/ioperm: Fix io bitmap invalidation on Xen PVAndy Lutomirski2-17/+4
tss_invalidate_io_bitmap() wasn't wired up properly through the pvop machinery, so the TSS and Xen's io bitmap would get out of sync whenever disabling a valid io bitmap. Add a new pvop for tss_invalidate_io_bitmap() to fix it. This is XSA-329. Fixes: 22fe5b0439dd ("x86/ioperm: Move TSS bitmap update to exit to user work") Signed-off-by: Andy Lutomirski <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Juergen Gross <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Cc: [email protected] Link: https://lkml.kernel.org/r/d53075590e1f91c19f8af705059d3ff99424c020.1595030016.git.luto@kernel.org
2020-07-17genirq/affinity: Handle affinity setting on inactive interrupts correctlyThomas Gleixner1-17/+5
Setting interrupt affinity on inactive interrupts is inconsistent when hierarchical irq domains are enabled. The core code should just store the affinity and not call into the irq chip driver for inactive interrupts because the chip drivers may not be in a state to handle such requests. X86 has a hacky workaround for that but all other irq chips have not which causes problems e.g. on GIC V3 ITS. Instead of adding more ugly hacks all over the place, solve the problem in the core code. If the affinity is set on an inactive interrupt then: - Store it in the irq descriptors affinity mask - Update the effective affinity to reflect that so user space has a consistent view - Don't call into the irq chip driver This is the core equivalent of the X86 workaround and works correctly because the affinity setting is established in the irq chip when the interrupt is activated later on. Note, that this is only effective when hierarchical irq domains are enabled by the architecture. Doing it unconditionally would break legacy irq chip implementations. For hierarchial irq domains this works correctly as none of the drivers can have a dependency on affinity setting in inactive state by design. Remove the X86 workaround as it is not longer required. Fixes: 02edee152d6e ("x86/apic/vector: Ignore set_affinity call for inactive interrupts") Reported-by: Ali Saidi <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Ali Saidi <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected]
2020-07-17x86/efi: Remove references to no-longer-used efi_have_uv1_memmap()[email protected]1-9/+0
In removing UV1 support, efi_have_uv1_memmap is no longer used. Signed-off-by: Steve Wahl <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Ard Biesheuvel <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-07-17x86/platform/uv: Remove support for UV1 platform from x2apic_uv_x[email protected]1-96/+26
UV1 is not longer supported. Signed-off-by: Steve Wahl <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-07-16treewide: Remove uninitialized_var() usageKees Cook1-5/+5
Using uninitialized_var() is dangerous as it papers over real bugs[1] (or can in the future), and suppresses unrelated compiler warnings (e.g. "unused variable"). If the compiler thinks it is uninitialized, either simply initialize the variable or make compiler changes. In preparation for removing[2] the[3] macro[4], remove all remaining needless uses with the following script: git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \ xargs perl -pi -e \ 's/\buninitialized_var\(([^\)]+)\)/\1/g; s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;' drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid pathological white-space. No outstanding warnings were found building allmodconfig with GCC 9.3.0 for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64, alpha, and m68k. [1] https://lore.kernel.org/lkml/[email protected]/ [2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/ [3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/ [4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/ Reviewed-by: Leon Romanovsky <[email protected]> # drivers/infiniband and mlx4/mlx5 Acked-by: Jason Gunthorpe <[email protected]> # IB Acked-by: Kalle Valo <[email protected]> # wireless drivers Reviewed-by: Chao Yu <[email protected]> # erofs Signed-off-by: Kees Cook <[email protected]>
2020-07-14irqdomain/treewide: Keep firmware node unconditionally allocatedThomas Gleixner3-12/+17
Quite some non OF/ACPI users of irqdomains allocate firmware nodes of type IRQCHIP_FWNODE_NAMED or IRQCHIP_FWNODE_NAMED_ID and free them right after creating the irqdomain. The only purpose of these FW nodes is to convey name information. When this was introduced the core code did not store the pointer to the node in the irqdomain. A recent change stored the firmware node pointer in irqdomain for other reasons and missed to notice that the usage sites which do the alloc_fwnode/create_domain/free_fwnode sequence are broken by this. Storing a dangling pointer is dangerous itself, but in case that the domain is destroyed later on this leads to a double free. Remove the freeing of the firmware node after creating the irqdomain from all affected call sites to cure this. Fixes: 711419e504eb ("irqdomain: Add the missing assignment of domain->fwnode for named fwnode") Reported-by: Andy Shevchenko <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Bjorn Helgaas <[email protected]> Acked-by: Marc Zyngier <[email protected]> Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected]
2020-07-10x86/entry: Fix NMI vs IRQ state trackingPeter Zijlstra2-16/+10
While the nmi_enter() users did trace_hardirqs_{off_prepare,on_finish}() there was no matching lockdep_hardirqs_*() calls to complete the picture. Introduce idtentry_{enter,exit}_nmi() to enable proper IRQ state tracking across the NMIs. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Ingo Molnar <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-07-10Merge branch 'x86/urgent' into x86/entry to pick up upstream fixes.Thomas Gleixner1-0/+2
2020-07-09x86/traps: Disable interrupts in exc_aligment_check()Thomas Gleixner1-0/+2
exc_alignment_check() fails to disable interrupts before returning to the entry code. Fixes: ca4c6a9858c2 ("x86/traps: Make interrupt enable/disable symmetric in C code") Reported-by: [email protected] Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Andy Lutomirski <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-07-08x86/kvm: Add "nopvspin" parameter to disable PV spinlocksZhenzhong Duan1-7/+32
There are cases where a guest tries to switch spinlocks to bare metal behavior (e.g. by setting "xen_nopvspin" on XEN platform and "hv_nopvspin" on HYPER_V). That feature is missed on KVM, add a new parameter "nopvspin" to disable PV spinlocks for KVM guest. The new 'nopvspin' parameter will also replace Xen and Hyper-V specific parameters in future patches. Define variable nopvsin as global because it will be used in future patches as above. Signed-off-by: Zhenzhong Duan <[email protected]> Reviewed-by: Vitaly Kuznetsov <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Radim Krcmar <[email protected]> Cc: Sean Christopherson <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Cc: Wanpeng Li <[email protected]> Cc: Jim Mattson <[email protected]> Cc: Joerg Roedel <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-07-08x86/kvm: Change print code to use pr_*() formatZhenzhong Duan1-9/+13
pr_*() is preferred than printk(KERN_* ...), after change all the print in arch/x86/kernel/kvm.c will have "kvm-guest: xxx" style. No functional change. Signed-off-by: Zhenzhong Duan <[email protected]> Reviewed-by: Vitaly Kuznetsov <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Radim Krcmar <[email protected]> Cc: Sean Christopherson <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Cc: Wanpeng Li <[email protected]> Cc: Jim Mattson <[email protected]> Cc: Joerg Roedel <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-07-08Revert "KVM: X86: Fix setup the virt_spin_lock_key before static key get ↵Zhenzhong Duan1-9/+3
initialized" This reverts commit 34226b6b70980a8f81fff3c09a2c889f77edeeff. Commit 8990cac6e5ea ("x86/jump_label: Initialize static branching early") adds jump_label_init() call in setup_arch() to make static keys initialized early, so we could use the original simpler code again. The similar change for XEN is in commit 090d54bcbc54 ("Revert "x86/paravirt: Set up the virt_spin_lock_key after static keys get initialized"") Signed-off-by: Zhenzhong Duan <[email protected]> Reviewed-by: Vitaly Kuznetsov <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Radim Krcmar <[email protected]> Cc: Sean Christopherson <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Cc: Wanpeng Li <[email protected]> Cc: Jim Mattson <[email protected]> Cc: Joerg Roedel <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-07-08Merge branch 'kvm-async-pf-int' into HEADPaolo Bonzini1-13/+34
2020-07-08Merge branch 'sched/urgent'Peter Zijlstra20-107/+150
2020-07-08perf/x86/intel/lbr: Support XSAVES/XRSTORS for LBR context switchKan Liang1-1/+1
In the LBR call stack mode, LBR information is used to reconstruct a call stack. To get the complete call stack, perf has to save/restore all LBR registers during a context switch. Due to a large number of the LBR registers, this process causes a high CPU overhead. To reduce the CPU overhead during a context switch, use the XSAVES/XRSTORS instructions. Every XSAVE area must follow a canonical format: the legacy region, an XSAVE header and the extended region. Although the LBR information is only kept in the extended region, a space for the legacy region and XSAVE header is still required. Add a new dedicated structure for LBR XSAVES support. Before enabling XSAVES support, the size of the LBR state has to be sanity checked, because: - the size of the software structure is calculated from the max number of the LBR depth, which is enumerated by the CPUID leaf for Arch LBR. The size of the LBR state is enumerated by the CPUID leaf for XSAVE support of Arch LBR. If the values from the two CPUID leaves are not consistent, it may trigger a buffer overflow. For example, a hypervisor may unconsciously set inconsistent values for the two emulated CPUID. - unlike other state components, the size of an LBR state depends on the max number of LBRs, which may vary from generation to generation. Expose the function xfeature_size() for the sanity check. The LBR XSAVES support will be disabled if the size of the LBR state enumerated by CPUID doesn't match with the size of the software structure. The XSAVE instruction requires 64-byte alignment for state buffers. A new macro is added to reflect the alignment requirement. A 64-byte aligned kmem_cache is created for architecture LBR. Currently, the structure for each state component is maintained in fpu/types.h. The structure for the new LBR state component should be maintained in the same place. Move structure lbr_entry to fpu/types.h as well for broader sharing. Add dedicated lbr_save/lbr_restore functions for LBR XSAVES support, which invokes the corresponding xstate helpers to XSAVES/XRSTORS LBR information at the context switch when the call stack mode is enabled. Since the XSAVES/XRSTORS instructions will be eventually invoked, the dedicated functions is named with '_xsaves'/'_xrstors' postfix. Signed-off-by: Kan Liang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Dave Hansen <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-07-08x86/fpu/xstate: Add helpers for LBR dynamic supervisor featureKan Liang1-0/+72
The perf subsystem will only need to save/restore the LBR state. However, the existing helpers save all supported supervisor states to a kernel buffer, which will be unnecessary. Two helpers are introduced to only save/restore requested dynamic supervisor states. The supervisor features in XFEATURE_MASK_SUPERVISOR_SUPPORTED and XFEATURE_MASK_SUPERVISOR_UNSUPPORTED mask cannot be saved/restored using these helpers. The helpers will be used in the following patch. Signed-off-by: Kan Liang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Dave Hansen <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-07-08x86/fpu/xstate: Support dynamic supervisor feature for LBRKan Liang1-5/+10
Last Branch Records (LBR) registers are used to log taken branches and other control flows. In perf with call stack mode, LBR information is used to reconstruct a call stack. To get the complete call stack, perf has to save/restore all LBR registers during a context switch. Due to the large number of the LBR registers, e.g., the current platform has 96 LBR registers, this process causes a high CPU overhead. To reduce the CPU overhead during a context switch, an LBR state component that contains all the LBR related registers is introduced in hardware. All LBR registers can be saved/restored together using one XSAVES/XRSTORS instruction. However, the kernel should not save/restore the LBR state component at each context switch, like other state components, because of the following unique features of LBR: - The LBR state component only contains valuable information when LBR is enabled in the perf subsystem, but for most of the time, LBR is disabled. - The size of the LBR state component is huge. For the current platform, it's 808 bytes. If the kernel saves/restores the LBR state at each context switch, for most of the time, it is just a waste of space and cycles. To efficiently support the LBR state component, it is desired to have: - only context-switch the LBR when the LBR feature is enabled in perf. - only allocate an LBR-specific XSAVE buffer on demand. (Besides the LBR state, a legacy region and an XSAVE header have to be included in the buffer as well. There is a total of (808+576) byte overhead for the LBR-specific XSAVE buffer. The overhead only happens when the perf is actively using LBRs. There is still a space-saving, on average, when it replaces the constant 808 bytes of overhead for every task, all the time on the systems that support architectural LBR.) - be able to use XSAVES/XRSTORS for accessing LBR at run time. However, the IA32_XSS should not be adjusted at run time. (The XCR0 | IA32_XSS are used to determine the requested-feature bitmap (RFBM) of XSAVES.) A solution, called dynamic supervisor feature, is introduced to address this issue, which - does not allocate a buffer in each task->fpu; - does not save/restore a state component at each context switch; - sets the bit corresponding to the dynamic supervisor feature in IA32_XSS at boot time, and avoids setting it at run time. - dynamically allocates a specific buffer for a state component on demand, e.g. only allocates LBR-specific XSAVE buffer when LBR is enabled in perf. (Note: The buffer has to include the LBR state component, a legacy region and a XSAVE header space.) (Implemented in a later patch) - saves/restores a state component on demand, e.g. manually invokes the XSAVES/XRSTORS instruction to save/restore the LBR state to/from the buffer when perf is active and a call stack is required. (Implemented in a later patch) A new mask XFEATURE_MASK_DYNAMIC and a helper xfeatures_mask_dynamic() are introduced to indicate the dynamic supervisor feature. For the systems which support the Architecture LBR, LBR is the only dynamic supervisor feature for now. For the previous systems, there is no dynamic supervisor feature available. Signed-off-by: Kan Liang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Dave Hansen <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-07-08x86/fpu: Use proper mask to replace full instruction maskKan Liang1-0/+39
When saving xstate to a kernel/user XSAVE area with the XSAVE family of instructions, the current code applies the 'full' instruction mask (-1), which tries to XSAVE all possible features. This method relies on hardware to trim 'all possible' down to what is enabled in the hardware. The code works well for now. However, there will be a problem, if some features are enabled in hardware, but are not suitable to be saved into all kernel XSAVE buffers, like task->fpu, due to performance consideration. One such example is the Last Branch Records (LBR) state. The LBR state only contains valuable information when LBR is explicitly enabled by the perf subsystem, and the size of an LBR state is large (808 bytes for now). To avoid both CPU overhead and space overhead at each context switch, the LBR state should not be saved into task->fpu like other state components. It should be saved/restored on demand when LBR is enabled in the perf subsystem. Current copy_xregs_to_* will trigger a buffer overflow for such cases. Three sites use the '-1' instruction mask which must be updated. Two are saving/restoring the xstate to/from a kernel-allocated XSAVE buffer and can use 'xfeatures_mask_all', which will save/restore all of the features present in a normal task FPU buffer. The last one saves the register state directly to a user buffer. It could also use 'xfeatures_mask_all'. Just as it was with the '-1' argument, any supervisor states in the mask will be filtered out by the hardware and not saved to the buffer. But, to be more explicit about what is expected to be saved, use xfeatures_mask_user() for the instruction mask. KVM includes the header file fpu/internal.h. To avoid 'undefined xfeatures_mask_all' compiling issue, move copy_fpregs_to_fpstate() to fpu/core.c and export it, because: - The xfeatures_mask_all is indirectly used via copy_fpregs_to_fpstate() by KVM. The function which is directly used by other modules should be exported. - The copy_fpregs_to_fpstate() is a function, while xfeatures_mask_all is a variable for the "internal" FPU state. It's safer to export a function than a variable, which may be implicitly changed by others. - The copy_fpregs_to_fpstate() is a big function with many checks. The removal of the inline keyword should not impact the performance. Signed-off-by: Kan Liang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Dave Hansen <[email protected]> Link: https://lkml.kernel.org/r/[email protected]