aboutsummaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)AuthorFilesLines
2020-09-04x86/debug: Change thread.debugreg6 to thread.virtual_dr6Peter Zijlstra5-24/+26
Current usage of thread.debugreg6 is convoluted at best. It starts life as a copy of the hardware DR6 value, but then various bits are cleared and set. Replace this with a new variable thread.virtual_dr6 that is initialized to 0 when DR6 is read and only gains bits, at the same time the actual (on stack) dr6 value which is read from the hardware only gets bits cleared. Suggested-by: Andy Lutomirski <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Daniel Thompson <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-04x86/debug: Support negative polarity DR6 bitsPeter Zijlstra3-6/+5
DR6 has a whole bunch of bits that have negative polarity; they were architecturally reserved and defined to be 1 and are now getting used. Since they're 1 by default, 0 becomes the signal value. Handle this by xor'ing the read DR6 value by the reserved mask, this will flip them around such that 1 is the signal value (positive polarity). Current Linux doesn't yet support any of these bits, but there's two defined: - DR6[11] Bus Lock Debug Exception (ISEr39) - DR6[16] Restricted Transactional Memory (SDM) Update ptrace_{set,get}_debugreg() to provide/consume the value in architectural polarity. Although afaict ptrace_set_debugreg(6) is pointless, the value is not consumed anywhere. Change hw_breakpoint_restore() to alway write the DR6_RESERVED value to DR6, again, no consumer for that write. Suggested-by: Andrew Cooper <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Daniel Thompson <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-04x86/debug: Simplify hw_breakpoint_handler()Peter Zijlstra1-6/+2
This is called with interrupts disabled, there's no point in using get_cpu() and per_cpu(). Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Daniel Thompson <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-04x86/debug: Remove aout_dump_debugregs()Peter Zijlstra2-38/+0
Unused remnants for the bit-bucket. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Daniel Thompson <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-04x86/debug: Remove the historical junkPeter Zijlstra1-11/+12
Remove the historical junk and replace it with a WARN and a comment. The problem is that even though the kernel only uses TF single-step in kprobes and KGDB, both of which consume the event before this, QEMU/KVM has bugs in this area that can trigger this state so it has to be dealt with. Suggested-by: Brian Gerst <[email protected]> Suggested-by: Andy Lutomirski <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Daniel Thompson <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-04x86/debug: Move cond_local_irq_enable() block into exc_debug_user()Peter Zijlstra1-29/+29
The cond_local_irq_enable() block, dealing with vm86 and sending signals is only relevant for #DB-from-user, move it there. This then reduces handle_debug() to only the notifier call, so rename it to notify_debug(). Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Daniel Thompson <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-04x86/debug: Move historical SYSENTER junk into exc_debug_kernel()Peter Zijlstra1-24/+25
The historical SYSENTER junk is explicitly for from-kernel, so move it to the #DB-from-kernel handler. It is ordered after the notifier, which is important for KGDB which uses TF single-step and needs to consume the event before that point. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Daniel Thompson <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-04x86/debug: Simplify #DB signal codePeter Zijlstra1-6/+9
There's no point in calculating si_code if it's not going to be used. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Daniel Thompson <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-04x86/debug: Remove handle_debug(.user) argumentPeter Zijlstra1-11/+10
The handle_debug(.user) argument is used to terminate the #DB handler early for the INT1-from-kernel case, since the kernel doesn't use INT1. Remove the argument and handle this explicitly in #DB-from-kernel. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Daniel Thompson <[email protected]> Acked-by: Andy Lutomirski <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-04x86/debug: Move kprobe_debug_handler() into exc_debug_kernel()Peter Zijlstra2-6/+8
Kprobes are on kernel text, and thus only matter for #DB-from-kernel. Kprobes are ordered before the generic notifier, preserve that order. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Daniel Thompson <[email protected]> Acked-by: Masami Hiramatsu <[email protected]> Acked-by: Andy Lutomirski <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-04x86/debug: Sync BTF earlierPeter Zijlstra1-7/+7
Move the BTF sync near the DR6 load, as this will be the only common code guaranteed to run on every #DB. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Daniel Thompson <[email protected]> Acked-by: Andy Lutomirski <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-04x86/debug: Allow a single level of #DB recursionAndy Lutomirski1-34/+31
Trying to clear DR7 around a #DB from usermode malfunctions if the tasks schedules when delivering SIGTRAP. Rather than trying to define a special no-recursion region, just allow a single level of recursion. The same mechanism is used for NMI, and it hasn't caused any problems yet. Fixes: 9f58fdde95c9 ("x86/db: Split out dr6/7 handling") Reported-by: Kyle Huey <[email protected]> Debugged-by: Josh Poimboeuf <[email protected]> Signed-off-by: Andy Lutomirski <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Daniel Thompson <[email protected]> Cc: [email protected] Link: https://lkml.kernel.org/r/8b9bd05f187231df008d48cf818a6a311cbd5c98.1597882384.git.luto@kernel.org Link: https://lore.kernel.org/r/[email protected]
2020-09-04x86/entry: Fix AC assertionPeter Zijlstra1-2/+10
The WARN added in commit 3c73b81a9164 ("x86/entry, selftests: Further improve user entry sanity checks") unconditionally triggers on a IVB machine because it does not support SMAP. For !SMAP hardware the CLAC/STAC instructions are patched out and thus if userspace sets AC, it is still have set after entry. Fixes: 3c73b81a9164 ("x86/entry, selftests: Further improve user entry sanity checks") Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Daniel Thompson <[email protected]> Acked-by: Andy Lutomirski <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/r/[email protected]
2020-09-04tracing/kprobes, x86/ptrace: Fix regs argument order for i386Vamshi K Sthambamkadi1-1/+1
On i386, the order of parameters passed on regs is eax,edx,and ecx (as per regparm(3) calling conventions). Change the mapping in regs_get_kernel_argument(), so that arg1=ax arg2=dx, and arg3=cx. Running the selftests testcase kprobes_args_use.tc shows the result as passed. Fixes: 3c88ee194c28 ("x86: ptrace: Add function argument access API") Signed-off-by: Vamshi K Sthambamkadi <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Acked-by: Masami Hiramatsu <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Cc: <[email protected]> Link: https://lkml.kernel.org/r/20200828113242.GA1424@cosmos
2020-09-04arm64: mte: Add specific SIGSEGV codesVincenzo Frascino1-1/+1
Add MTE-specific SIGSEGV codes to siginfo.h and update the x86 BUILD_BUG_ON(NSIGSEGV != 7) compile check. Signed-off-by: Vincenzo Frascino <[email protected]> [[email protected]: renamed precise/imprecise to sync/async] [[email protected]: dropped #ifdef __aarch64__, renumbered] Signed-off-by: Catalin Marinas <[email protected]> Acked-by: "Eric W. Biederman" <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Will Deacon <[email protected]>
2020-09-04x86/entry/64: Do not include inst.h in calling.hUros Bizjak1-1/+0
inst.h was included in calling.h solely to instantiate the RDPID macro. The usage of RDPID was removed in 6a3ea3e68b8a ("x86/entry/64: Do not use RDPID in paranoid entry to accomodate KVM") so remove the include. Fixes: 6a3ea3e68b8a ("x86/entry/64: Do not use RDPID in paranoid entry to accomodate KVM") Signed-off-by: Uros Bizjak <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-09-04x86, fakenuma: Fix invalid starting node IDHuang Ying1-1/+1
Commit: cc9aec03e58f ("x86/numa_emulation: Introduce uniform split capability") uses "-1" as the starting node ID, which causes the strange kernel log as follows, when "numa=fake=32G" is added to the kernel command line: Faking node -1 at [mem 0x0000000000000000-0x0000000893ffffff] (35136MB) Faking node 0 at [mem 0x0000001840000000-0x000000203fffffff] (32768MB) Faking node 1 at [mem 0x0000000894000000-0x000000183fffffff] (64192MB) Faking node 2 at [mem 0x0000002040000000-0x000000283fffffff] (32768MB) Faking node 3 at [mem 0x0000002840000000-0x000000303fffffff] (32768MB) And finally the kernel crashes: BUG: Bad page state in process swapper pfn:00011 page:(____ptrval____) refcount:0 mapcount:1 mapping:(____ptrval____) index:0x55cd7e44b270 pfn:0x11 failed to read mapping contents, not a valid kernel address? flags: 0x5(locked|uptodate) raw: 0000000000000005 000055cd7e44af30 000055cd7e44af50 0000000100000006 raw: 000055cd7e44b270 000055cd7e44b290 0000000000000000 000055cd7e44b510 page dumped because: page still charged to cgroup page->mem_cgroup:000055cd7e44b510 Modules linked in: CPU: 0 PID: 0 Comm: swapper Not tainted 5.9.0-rc2 #1 Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.02.01.0008.031920191559 03/19/2019 Call Trace: dump_stack+0x57/0x80 bad_page.cold+0x63/0x94 __free_pages_ok+0x33f/0x360 memblock_free_all+0x127/0x195 mem_init+0x23/0x1f5 start_kernel+0x219/0x4f5 secondary_startup_64+0xb6/0xc0 Fix this bug via using 0 as the starting node ID. This restores the original behavior before cc9aec03e58f. [ mingo: Massaged the changelog. ] Fixes: cc9aec03e58f ("x86/numa_emulation: Introduce uniform split capability") Signed-off-by: "Huang, Ying" <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-03x86/uaccess: Use XORL %0,%0 in __get_user_asm()Uros Bizjak1-1/+1
XORL %0,%0 is equivalent to XORQ %0,%0 as both will zero the entire register. Use XORL %0,%0 for all operand sizes to avoid REX prefix byte when legacy registers are used and to avoid size prefix byte when 16bit registers are used. Zeroing the full register is OK in this use case. As a result, the size of the .fixup section decreases by 20 bytes. [ bp: Massage commit message. ] Signed-off-by: Uros Bizjak <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: H. Peter Anvin (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-09-03dma-mapping: introduce dma_get_seg_boundary_nr_pages()Nicolin Chen1-2/+1
We found that callers of dma_get_seg_boundary mostly do an ALIGN with page mask and then do a page shift to get number of pages: ALIGN(boundary + 1, 1 << shift) >> shift However, the boundary might be as large as ULONG_MAX, which means that a device has no specific boundary limit. So either "+ 1" or passing it to ALIGN() would potentially overflow. According to kernel defines: #define ALIGN_MASK(x, mask) (((x) + (mask)) & ~(mask)) #define ALIGN(x, a) ALIGN_MASK(x, (typeof(x))(a) - 1) We can simplify the logic here into a helper function doing: ALIGN(boundary + 1, 1 << shift) >> shift = ALIGN_MASK(b + 1, (1 << s) - 1) >> s = {[b + 1 + (1 << s) - 1] & ~[(1 << s) - 1]} >> s = [b + 1 + (1 << s) - 1] >> s = [b + (1 << s)] >> s = (b >> s) + 1 This patch introduces and applies dma_get_seg_boundary_nr_pages() as an overflow-free helper for the dma_get_seg_boundary() callers to get numbers of pages. It also takes care of the NULL dev case for non-DMA API callers. Suggested-by: Christoph Hellwig <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Acked-by: Niklas Schnelle <[email protected]> Acked-by: Michael Ellerman <[email protected]> (powerpc) Signed-off-by: Christoph Hellwig <[email protected]>
2020-09-03x86/mm/32: Bring back vmalloc faulting on x86_32Joerg Roedel1-0/+78
One can not simply remove vmalloc faulting on x86-32. Upstream commit: 7f0a002b5a21 ("x86/mm: remove vmalloc faulting") removed it on x86 alltogether because previously the arch_sync_kernel_mappings() interface was introduced. This interface added synchronization of vmalloc/ioremap page-table updates to all page-tables in the system at creation time and was thought to make vmalloc faulting obsolete. But that assumption was incredibly naive. It turned out that there is a race window between the time the vmalloc or ioremap code establishes a mapping and the time it synchronizes this change to other page-tables in the system. During this race window another CPU or thread can establish a vmalloc mapping which uses the same intermediate page-table entries (e.g. PMD or PUD) and does no synchronization in the end, because it found all necessary mappings already present in the kernel reference page-table. But when these intermediate page-table entries are not yet synchronized, the other CPU or thread will continue with a vmalloc address that is not yet mapped in the page-table it currently uses, causing an unhandled page fault and oops like below: BUG: unable to handle page fault for address: fe80c000 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page *pde = 33183067 *pte = a8648163 Oops: 0002 [#1] SMP CPU: 1 PID: 13514 Comm: cve-2017-17053 Tainted: G ... Call Trace: ldt_dup_context+0x66/0x80 dup_mm+0x2b3/0x480 copy_process+0x133b/0x15c0 _do_fork+0x94/0x3e0 __ia32_sys_clone+0x67/0x80 __do_fast_syscall_32+0x3f/0x70 do_fast_syscall_32+0x29/0x60 do_SYSENTER_32+0x15/0x20 entry_SYSENTER_32+0x9f/0xf2 EIP: 0xb7eef549 So the arch_sync_kernel_mappings() interface is racy, but removing it would mean to re-introduce the vmalloc_sync_all() interface, which is even more awful. Keep arch_sync_kernel_mappings() in place and catch the race condition in the page-fault handler instead. Do a partial revert of above commit to get vmalloc faulting on x86-32 back in place. Fixes: 7f0a002b5a21 ("x86/mm: remove vmalloc faulting") Reported-by: Naresh Kamboju <[email protected]> Signed-off-by: Joerg Roedel <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-03x86/cmdline: Disable jump tables for cmdline.cArvind Sankar1-1/+1
When CONFIG_RETPOLINE is disabled, Clang uses a jump table for the switch statement in cmdline_find_option (jump tables are disabled when CONFIG_RETPOLINE is enabled). This function is called very early in boot from sme_enable() if CONFIG_AMD_MEM_ENCRYPT is enabled. At this time, the kernel is still executing out of the identity mapping, but the jump table will contain virtual addresses. Fix this by disabling jump tables for cmdline.c when AMD_MEM_ENCRYPT is enabled. Signed-off-by: Arvind Sankar <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-03x86/boot/compressed: Warn on orphan section placementKees Cook1-0/+1
We don't want to depend on the linker's orphan section placement heuristics as these can vary between linkers, and may change between versions. All sections need to be explicitly handled in the linker script. Now that all sections are explicitly handled, enable orphan section warnings. Signed-off-by: Kees Cook <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Reviewed-by: Nick Desaulniers <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-03x86/build: Warn on orphan section placementKees Cook1-0/+4
We don't want to depend on the linker's orphan section placement heuristics as these can vary between linkers, and may change between versions. All sections need to be explicitly handled in the linker script. Now that all sections are explicitly handled, enable orphan section warnings. Signed-off-by: Kees Cook <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Reviewed-by: Nick Desaulniers <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller1-11/+21
Daniel Borkmann says: ==================== pull-request: bpf-next 2020-09-01 The following pull-request contains BPF updates for your *net-next* tree. There are two small conflicts when pulling, resolve as follows: 1) Merge conflict in tools/lib/bpf/libbpf.c between 88a82120282b ("libbpf: Factor out common ELF operations and improve logging") in bpf-next and 1e891e513e16 ("libbpf: Fix map index used in error message") in net-next. Resolve by taking the hunk in bpf-next: [...] scn = elf_sec_by_idx(obj, obj->efile.btf_maps_shndx); data = elf_sec_data(obj, scn); if (!scn || !data) { pr_warn("elf: failed to get %s map definitions for %s\n", MAPS_ELF_SEC, obj->path); return -EINVAL; } [...] 2) Merge conflict in drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c between 9647c57b11e5 ("xsk: i40e: ice: ixgbe: mlx5: Test for dma_need_sync earlier for better performance") in bpf-next and e20f0dbf204f ("net/mlx5e: RX, Add a prefetch command for small L1_CACHE_BYTES") in net-next. Resolve the two locations by retaining net_prefetch() and taking xsk_buff_dma_sync_for_cpu() from bpf-next. Should look like: [...] xdp_set_data_meta_invalid(xdp); xsk_buff_dma_sync_for_cpu(xdp, rq->xsk_pool); net_prefetch(xdp->data); [...] We've added 133 non-merge commits during the last 14 day(s) which contain a total of 246 files changed, 13832 insertions(+), 3105 deletions(-). The main changes are: 1) Initial support for sleepable BPF programs along with bpf_copy_from_user() helper for tracing to reliably access user memory, from Alexei Starovoitov. 2) Add BPF infra for writing and parsing TCP header options, from Martin KaFai Lau. 3) bpf_d_path() helper for returning full path for given 'struct path', from Jiri Olsa. 4) AF_XDP support for shared umems between devices and queues, from Magnus Karlsson. 5) Initial prep work for full BPF-to-BPF call support in libbpf, from Andrii Nakryiko. 6) Generalize bpf_sk_storage map & add local storage for inodes, from KP Singh. 7) Implement sockmap/hash updates from BPF context, from Lorenz Bauer. 8) BPF xor verification for scalar types & add BPF link iterator, from Yonghong Song. 9) Use target's prog type for BPF_PROG_TYPE_EXT prog verification, from Udip Pant. 10) Rework BPF tracing samples to use libbpf loader, from Daniel T. Lee. 11) Fix xdpsock sample to really cycle through all buffers, from Weqaar Janjua. 12) Improve type safety for tun/veth XDP frame handling, from Maciej Żenczykowski. 13) Various smaller cleanups and improvements all over the place. ==================== Signed-off-by: David S. Miller <[email protected]>
2020-09-01x86/PCI: Fix intel_mid_pci.c build error when ACPI is not enabledRandy Dunlap1-0/+1
Fix build error when CONFIG_ACPI is not set/enabled by adding the header file <asm/acpi.h> which contains a stub for the function in the build error. ../arch/x86/pci/intel_mid_pci.c: In function ‘intel_mid_pci_init’: ../arch/x86/pci/intel_mid_pci.c:303:2: error: implicit declaration of function ‘acpi_noirq_set’; did you mean ‘acpi_irq_get’? [-Werror=implicit-function-declaration] acpi_noirq_set(); Fixes: a912a7584ec3 ("x86/platform/intel-mid: Move PCI initialization to arch_init()") Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Randy Dunlap <[email protected]> Signed-off-by: Bjorn Helgaas <[email protected]> Reviewed-by: Andy Shevchenko <[email protected]> Reviewed-by: Jesse Barnes <[email protected]> Acked-by: Thomas Gleixner <[email protected]> Cc: [email protected] # v4.16+ Cc: Jacob Pan <[email protected]> Cc: Len Brown <[email protected]> Cc: Jesse Barnes <[email protected]> Cc: Arjan van de Ven <[email protected]>
2020-09-01x86/boot/compressed: Add missing debugging sections to outputKees Cook1-0/+2
Include the missing DWARF and STABS sections in the compressed image, when they are present. Signed-off-by: Kees Cook <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-01x86/boot/compressed: Remove, discard, or assert for unwanted sectionsKees Cook2-2/+13
In preparation for warning on orphan sections, stop the linker from generating the .eh_frame* sections, discard unwanted non-zero-sized generated sections, and enforce other expected-to-be-zero-sized sections (since discarding them might hide problems with them suddenly gaining unexpected entries). Signed-off-by: Kees Cook <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-01x86/boot/compressed: Reorganize zero-size section assertsKees Cook1-18/+26
For readability, move the zero-sized sections to the end after DISCARDS. Signed-off-by: Kees Cook <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-01x86/build: Add asserts for unwanted sectionsKees Cook1-0/+24
In preparation for warning on orphan sections, enforce other expected-to-be-zero-sized sections (since discarding them might hide problems with them suddenly gaining unexpected entries). Signed-off-by: Kees Cook <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-01x86/build: Enforce an empty .got.plt sectionKees Cook1-1/+13
The .got.plt section should always be zero (or filled only with the linker-generated lazy dispatch entry). Enforce this with an assert and mark the section as INFO. This is more sensitive than just blindly discarding the section. Signed-off-by: Kees Cook <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-01x86/asm: Avoid generating unused kprobe sectionsKees Cook1-1/+5
When !CONFIG_KPROBES, do not generate kprobe sections. This makes sure there are no unexpected sections encountered by the linker scripts. Signed-off-by: Kees Cook <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-01x86/perf, static_call: Optimize x86_pmu methodsPeter Zijlstra1-40/+94
Replace many of the indirect calls with static_call(). The average PMI time, as measured by perf_sample_event_took()*: PRE: 3283.03 [ns] POST: 3145.12 [ns] Which is a ~138 [ns] win per PMI, or a ~4.2% decrease. [*] on an IVB-EP, using: 'perf record -a -e cycles -- make O=defconfig-build/ -j80' Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Cc: Linus Torvalds <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-01static_call: Allow early initPeter Zijlstra2-1/+6
In order to use static_call() to wire up x86_pmu, we need to initialize earlier, specifically before memory allocation works; copy some of the tricks from jump_label to enable this. Primarily we overload key->next to store a sites pointer when there are no modules, this avoids having to use kmalloc() to initialize the sites and allows us to run much earlier. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Reviewed-by: Steven Rostedt (VMware) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-01static_call: Add some validationPeter Zijlstra1-2/+26
Verify the text we're about to change is as we expect it to be. Requested-by: Steven Rostedt <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-01static_call: Handle tail-callsPeter Zijlstra1-3/+18
GCC can turn our static_call(name)(args...) into a tail call, in which case we get a JMP.d32 into the trampoline (which then does a further tail-call). Teach objtool to recognise and mark these in .static_call_sites and adjust the code patching to deal with this. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Cc: Linus Torvalds <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-01static_call: Add static_call_cond()Peter Zijlstra2-13/+41
Extend the static_call infrastructure to optimize the following common pattern: if (func_ptr) func_ptr(args...) For the trampoline (which is in effect a tail-call), we patch the JMP.d32 into a RET, which then directly consumes the trampoline call. For the in-line sites we replace the CALL with a NOP5. NOTE: this is 'obviously' limited to functions with a 'void' return type. NOTE: DEFINE_STATIC_COND_CALL() only requires a typename, as opposed to a full function. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Cc: Linus Torvalds <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-01x86/alternatives: Teach text_poke_bp() to emulate RETPeter Zijlstra2-0/+24
Future patches will need to poke a RET instruction, provide the infrastructure required for this. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Reviewed-by: Steven Rostedt (VMware) <[email protected]> Cc: Masami Hiramatsu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-01x86/static_call: Add inline static call implementation for x86-64Josh Poimboeuf4-2/+18
Add the inline static call implementation for x86-64. The generated code is identical to the out-of-line case, except we move the trampoline into it's own section. Objtool uses the trampoline naming convention to detect all the call sites. It then annotates those call sites in the .static_call_sites section. During boot (and module init), the call sites are patched to call directly into the destination function. The temporary trampoline is then no longer used. [peterz: merged trampolines, put trampoline in section] Signed-off-by: Josh Poimboeuf <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Cc: Linus Torvalds <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-01x86/static_call: Add out-of-line static call implementationJosh Poimboeuf4-0/+56
Add the x86 out-of-line static call implementation. For each key, a permanent trampoline is created which is the destination for all static calls for the given key. The trampoline has a direct jump which gets patched by static_call_update() when the destination function changes. [peterz: fixed trampoline, rewrote patching code] Signed-off-by: Josh Poimboeuf <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Cc: Linus Torvalds <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-01static_call: Avoid kprobes on inline static_call()sPeter Zijlstra1-1/+3
Similar to how we disallow kprobes on any other dynamic text (ftrace/jump_label) also disallow kprobes on inline static_call()s. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-01vmlinux.lds.h: Split ELF_DETAILS from STABS_DEBUGKees Cook2-0/+3
The .comment section doesn't belong in STABS_DEBUG. Split it out into a new macro named ELF_DETAILS. This will gain other non-debug sections that need to be accounted for when linking with --orphan-handling=warn. Signed-off-by: Kees Cook <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/r/[email protected]
2020-08-30x86/kvm: Expose TSX Suspend Load Tracking featureCathy Zhang1-1/+1
TSX suspend load tracking instruction is supported by the Intel uarch Sapphire Rapids. It aims to give a way to choose which memory accesses do not need to be tracked in the TSX read set. It's availability is indicated as CPUID.(EAX=7,ECX=0):EDX[bit 16]. Expose TSX Suspend Load Address Tracking feature in KVM CPUID, so KVM could pass this information to guests and they can make use of this feature accordingly. Signed-off-by: Cathy Zhang <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Tony Luck <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-08-30Merge tag 'x86-urgent-2020-08-30' of ↵Linus Torvalds2-13/+29
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: "Three interrupt related fixes for X86: - Move disabling of the local APIC after invoking fixup_irqs() to ensure that interrupts which are incoming are noted in the IRR and not ignored. - Unbreak affinity setting. The rework of the entry code reused the regular exception entry code for device interrupts. The vector number is pushed into the errorcode slot on the stack which is then lifted into an argument and set to -1 because that's regs->orig_ax which is used in quite some places to check whether the entry came from a syscall. But it was overlooked that orig_ax is used in the affinity cleanup code to validate whether the interrupt has arrived on the new target. It turned out that this vector check is pointless because interrupts are never moved from one vector to another on the same CPU. That check is a historical leftover from the time where x86 supported multi-CPU affinities, but not longer needed with the now strict single CPU affinity. Famous last words ... - Add a missing check for an empty cpumask into the matrix allocator. The affinity change added a warning to catch the case where an interrupt is moved on the same CPU to a different vector. This triggers because a condition with an empty cpumask returns an assignment from the allocator as the allocator uses for_each_cpu() without checking the cpumask for being empty. The historical inconsistent for_each_cpu() behaviour of ignoring the cpumask and unconditionally claiming that CPU0 is in the mask struck again. Sigh. plus a new entry into the MAINTAINER file for the HPE/UV platform" * tag 'x86-urgent-2020-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq/matrix: Deal with the sillyness of for_each_cpu() on UP x86/irq: Unbreak interrupt affinity setting x86/hotplug: Silence APIC only after all interrupts are migrated MAINTAINERS: Add entry for HPE Superdome Flex (UV) maintainers
2020-08-30Merge tag 'locking-urgent-2020-08-30' of ↵Linus Torvalds4-20/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fixes from Thomas Gleixner: "A set of fixes for lockdep, tracing and RCU: - Prevent recursion by using raw_cpu_* operations - Fixup the interrupt state in the cpu idle code to be consistent - Push rcu_idle_enter/exit() invocations deeper into the idle path so that the lock operations are inside the RCU watching sections - Move trace_cpu_idle() into generic code so it's called before RCU goes idle. - Handle raw_local_irq* vs. local_irq* operations correctly - Move the tracepoints out from under the lockdep recursion handling which turned out to be fragile and inconsistent" * tag 'locking-urgent-2020-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: lockdep,trace: Expose tracepoints lockdep: Only trace IRQ edges mips: Implement arch_irqs_disabled() arm64: Implement arch_irqs_disabled() nds32: Implement arch_irqs_disabled() locking/lockdep: Cleanup x86/entry: Remove unused THUNKs cpuidle: Move trace_cpu_idle() into generic code cpuidle: Make CPUIDLE_FLAG_TLB_FLUSHED generic sched,idle,rcu: Push rcu_idle deeper into the idle path cpuidle: Fixup IRQ state lockdep: Use raw_cpu_*() for per-cpu variables
2020-08-30x86/cpufeatures: Enumerate TSX suspend load address tracking instructionsKyung Min Park1-0/+1
Intel TSX suspend load tracking instructions aim to give a way to choose which memory accesses do not need to be tracked in the TSX read set. Add TSX suspend load tracking CPUID feature flag TSXLDTRK for enumeration. A processor supports Intel TSX suspend load address tracking if CPUID.0x07.0x0:EDX[16] is present. Two instructions XSUSLDTRK, XRESLDTRK are available when this feature is present. The CPU feature flag is shown as "tsxldtrk" in /proc/cpuinfo. Signed-off-by: Kyung Min Park <[email protected]> Signed-off-by: Cathy Zhang <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Tony Luck <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-08-28bpf: Introduce sleepable BPF programsAlexei Starovoitov1-11/+21
Introduce sleepable BPF programs that can request such property for themselves via BPF_F_SLEEPABLE flag at program load time. In such case they will be able to use helpers like bpf_copy_from_user() that might sleep. At present only fentry/fexit/fmod_ret and lsm programs can request to be sleepable and only when they are attached to kernel functions that are known to allow sleeping. The non-sleepable programs are relying on implicit rcu_read_lock() and migrate_disable() to protect life time of programs, maps that they use and per-cpu kernel structures used to pass info between bpf programs and the kernel. The sleepable programs cannot be enclosed into rcu_read_lock(). migrate_disable() maps to preempt_disable() in non-RT kernels, so the progs should not be enclosed in migrate_disable() as well. Therefore rcu_read_lock_trace is used to protect the life time of sleepable progs. There are many networking and tracing program types. In many cases the 'struct bpf_prog *' pointer itself is rcu protected within some other kernel data structure and the kernel code is using rcu_dereference() to load that program pointer and call BPF_PROG_RUN() on it. All these cases are not touched. Instead sleepable bpf programs are allowed with bpf trampoline only. The program pointers are hard-coded into generated assembly of bpf trampoline and synchronize_rcu_tasks_trace() is used to protect the life time of the program. The same trampoline can hold both sleepable and non-sleepable progs. When rcu_read_lock_trace is held it means that some sleepable bpf program is running from bpf trampoline. Those programs can use bpf arrays and preallocated hash/lru maps. These map types are waiting on programs to complete via synchronize_rcu_tasks_trace(); Updates to trampoline now has to do synchronize_rcu_tasks_trace() and synchronize_rcu_tasks() to wait for sleepable progs to finish and for trampoline assembly to finish. This is the first step of introducing sleepable progs. Eventually dynamically allocated hash maps can be allowed and networking program types can become sleepable too. Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Acked-by: KP Singh <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2020-08-27x86/mpparse: Remove duplicate io_apic.h includeWang Hai1-1/+0
Remove asm/io_apic.h which is included more than once. Reported-by: Hulk Robot <[email protected]> Signed-off-by: Wang Hai <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Pekka Enberg <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-08-27x86/irq: Unbreak interrupt affinity settingThomas Gleixner1-7/+9
Several people reported that 5.8 broke the interrupt affinity setting mechanism. The consolidation of the entry code reused the regular exception entry code for device interrupts and changed the way how the vector number is conveyed from ptregs->orig_ax to a function argument. The low level entry uses the hardware error code slot to push the vector number onto the stack which is retrieved from there into a function argument and the slot on stack is set to -1. The reason for setting it to -1 is that the error code slot is at the position where pt_regs::orig_ax is. A positive value in pt_regs::orig_ax indicates that the entry came via a syscall. If it's not set to a negative value then a signal delivery on return to userspace would try to restart a syscall. But there are other places which rely on pt_regs::orig_ax being a valid indicator for syscall entry. But setting pt_regs::orig_ax to -1 has a nasty side effect vs. the interrupt affinity setting mechanism, which was overlooked when this change was made. Moving interrupts on x86 happens in several steps. A new vector on a different CPU is allocated and the relevant interrupt source is reprogrammed to that. But that's racy and there might be an interrupt already in flight to the old vector. So the old vector is preserved until the first interrupt arrives on the new vector and the new target CPU. Once that happens the old vector is cleaned up, but this cleanup still depends on the vector number being stored in pt_regs::orig_ax, which is now -1. That -1 makes the check for cleanup: pt_regs::orig_ax == new_vector always false. As a consequence the interrupt is moved once, but then it cannot be moved anymore because the cleanup of the old vector never happens. There would be several ways to convey the vector information to that place in the guts of the interrupt handling, but on deeper inspection it turned out that this check is pointless and a leftover from the old affinity model of X86 which supported multi-CPU affinities. Under this model it was possible that an interrupt had an old and a new vector on the same CPU, so the vector match was required. Under the new model the effective affinity of an interrupt is always a single CPU from the requested affinity mask. If the affinity mask changes then either the interrupt stays on the CPU and on the same vector when that CPU is still in the new affinity mask or it is moved to a different CPU, but it is never moved to a different vector on the same CPU. Ergo the cleanup check for the matching vector number is not required and can be removed which makes the dependency on pt_regs:orig_ax go away. The remaining check for new_cpu == smp_processsor_id() is completely sufficient. If it matches then the interrupt was successfully migrated and the cleanup can proceed. For paranoia sake add a warning into the vector assignment code to validate that the assumption of never moving to a different vector on the same CPU holds. Fixes: 633260fa143 ("x86/irq: Convey vector as argument and not in ptregs") Reported-by: Alex bykov <[email protected]> Reported-by: Avi Kivity <[email protected]> Reported-by: Alexander Graf <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Alexander Graf <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/r/[email protected]
2020-08-27x86/hotplug: Silence APIC only after all interrupts are migratedAshok Raj1-6/+20
There is a race when taking a CPU offline. Current code looks like this: native_cpu_disable() { ... apic_soft_disable(); /* * Any existing set bits for pending interrupt to * this CPU are preserved and will be sent via IPI * to another CPU by fixup_irqs(). */ cpu_disable_common(); { .... /* * Race window happens here. Once local APIC has been * disabled any new interrupts from the device to * the old CPU are lost */ fixup_irqs(); // Too late to capture anything in IRR. ... } } The fix is to disable the APIC *after* cpu_disable_common(). Testing was done with a USB NIC that provided a source of frequent interrupts. A script migrated interrupts to a specific CPU and then took that CPU offline. Fixes: 60dcaad5736f ("x86/hotplug: Silence APIC and NMI when CPU is dead") Reported-by: Evan Green <[email protected]> Signed-off-by: Ashok Raj <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Mathias Nyman <[email protected]> Tested-by: Evan Green <[email protected]> Reviewed-by: Evan Green <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/lkml/[email protected]/ Link: https://lore.kernel.org/r/[email protected]
2020-08-26x86/mce: Delay clearing IA32_MCG_STATUS to the end of do_machine_check()Tony Luck1-5/+4
A long time ago, Linux cleared IA32_MCG_STATUS at the very end of machine check processing. Then, some fancy recovery and IST manipulation was added in: d4812e169de4 ("x86, mce: Get rid of TIF_MCE_NOTIFY and associated mce tricks") and clearing IA32_MCG_STATUS was pulled earlier in the function. Next change moved the actual recovery out of do_machine_check() and just used task_work_add() to schedule it later (before returning to the user): 5567d11c21a1 ("x86/mce: Send #MC singal from task work") Most recently the fancy IST footwork was removed as no longer needed: b052df3da821 ("x86/entry: Get rid of ist_begin/end_non_atomic()") At this point there is no reason remaining to clear IA32_MCG_STATUS early. It can move back to the very end of the function. Also move sync_core(). The comments for this function say that it should only be called when instructions have been changed/re-mapped. Recovery for an instruction fetch may change the physical address. But that doesn't happen until the scheduled work runs (which could be on another CPU). [ bp: Massage commit message. ] Reported-by: Gabriele Paoloni <[email protected]> Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lkml.kernel.org/r/[email protected]