aboutsummaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)AuthorFilesLines
2020-09-04tracing/kprobes, x86/ptrace: Fix regs argument order for i386Vamshi K Sthambamkadi1-1/+1
On i386, the order of parameters passed on regs is eax,edx,and ecx (as per regparm(3) calling conventions). Change the mapping in regs_get_kernel_argument(), so that arg1=ax arg2=dx, and arg3=cx. Running the selftests testcase kprobes_args_use.tc shows the result as passed. Fixes: 3c88ee194c28 ("x86: ptrace: Add function argument access API") Signed-off-by: Vamshi K Sthambamkadi <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Acked-by: Masami Hiramatsu <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Cc: <[email protected]> Link: https://lkml.kernel.org/r/20200828113242.GA1424@cosmos
2020-09-04arm64: mte: Add specific SIGSEGV codesVincenzo Frascino1-1/+1
Add MTE-specific SIGSEGV codes to siginfo.h and update the x86 BUILD_BUG_ON(NSIGSEGV != 7) compile check. Signed-off-by: Vincenzo Frascino <[email protected]> [[email protected]: renamed precise/imprecise to sync/async] [[email protected]: dropped #ifdef __aarch64__, renumbered] Signed-off-by: Catalin Marinas <[email protected]> Acked-by: "Eric W. Biederman" <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Will Deacon <[email protected]>
2020-09-04x86/entry/64: Do not include inst.h in calling.hUros Bizjak1-1/+0
inst.h was included in calling.h solely to instantiate the RDPID macro. The usage of RDPID was removed in 6a3ea3e68b8a ("x86/entry/64: Do not use RDPID in paranoid entry to accomodate KVM") so remove the include. Fixes: 6a3ea3e68b8a ("x86/entry/64: Do not use RDPID in paranoid entry to accomodate KVM") Signed-off-by: Uros Bizjak <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-09-04x86, fakenuma: Fix invalid starting node IDHuang Ying1-1/+1
Commit: cc9aec03e58f ("x86/numa_emulation: Introduce uniform split capability") uses "-1" as the starting node ID, which causes the strange kernel log as follows, when "numa=fake=32G" is added to the kernel command line: Faking node -1 at [mem 0x0000000000000000-0x0000000893ffffff] (35136MB) Faking node 0 at [mem 0x0000001840000000-0x000000203fffffff] (32768MB) Faking node 1 at [mem 0x0000000894000000-0x000000183fffffff] (64192MB) Faking node 2 at [mem 0x0000002040000000-0x000000283fffffff] (32768MB) Faking node 3 at [mem 0x0000002840000000-0x000000303fffffff] (32768MB) And finally the kernel crashes: BUG: Bad page state in process swapper pfn:00011 page:(____ptrval____) refcount:0 mapcount:1 mapping:(____ptrval____) index:0x55cd7e44b270 pfn:0x11 failed to read mapping contents, not a valid kernel address? flags: 0x5(locked|uptodate) raw: 0000000000000005 000055cd7e44af30 000055cd7e44af50 0000000100000006 raw: 000055cd7e44b270 000055cd7e44b290 0000000000000000 000055cd7e44b510 page dumped because: page still charged to cgroup page->mem_cgroup:000055cd7e44b510 Modules linked in: CPU: 0 PID: 0 Comm: swapper Not tainted 5.9.0-rc2 #1 Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.02.01.0008.031920191559 03/19/2019 Call Trace: dump_stack+0x57/0x80 bad_page.cold+0x63/0x94 __free_pages_ok+0x33f/0x360 memblock_free_all+0x127/0x195 mem_init+0x23/0x1f5 start_kernel+0x219/0x4f5 secondary_startup_64+0xb6/0xc0 Fix this bug via using 0 as the starting node ID. This restores the original behavior before cc9aec03e58f. [ mingo: Massaged the changelog. ] Fixes: cc9aec03e58f ("x86/numa_emulation: Introduce uniform split capability") Signed-off-by: "Huang, Ying" <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-03x86/uaccess: Use XORL %0,%0 in __get_user_asm()Uros Bizjak1-1/+1
XORL %0,%0 is equivalent to XORQ %0,%0 as both will zero the entire register. Use XORL %0,%0 for all operand sizes to avoid REX prefix byte when legacy registers are used and to avoid size prefix byte when 16bit registers are used. Zeroing the full register is OK in this use case. As a result, the size of the .fixup section decreases by 20 bytes. [ bp: Massage commit message. ] Signed-off-by: Uros Bizjak <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: H. Peter Anvin (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-09-03dma-mapping: introduce dma_get_seg_boundary_nr_pages()Nicolin Chen1-2/+1
We found that callers of dma_get_seg_boundary mostly do an ALIGN with page mask and then do a page shift to get number of pages: ALIGN(boundary + 1, 1 << shift) >> shift However, the boundary might be as large as ULONG_MAX, which means that a device has no specific boundary limit. So either "+ 1" or passing it to ALIGN() would potentially overflow. According to kernel defines: #define ALIGN_MASK(x, mask) (((x) + (mask)) & ~(mask)) #define ALIGN(x, a) ALIGN_MASK(x, (typeof(x))(a) - 1) We can simplify the logic here into a helper function doing: ALIGN(boundary + 1, 1 << shift) >> shift = ALIGN_MASK(b + 1, (1 << s) - 1) >> s = {[b + 1 + (1 << s) - 1] & ~[(1 << s) - 1]} >> s = [b + 1 + (1 << s) - 1] >> s = [b + (1 << s)] >> s = (b >> s) + 1 This patch introduces and applies dma_get_seg_boundary_nr_pages() as an overflow-free helper for the dma_get_seg_boundary() callers to get numbers of pages. It also takes care of the NULL dev case for non-DMA API callers. Suggested-by: Christoph Hellwig <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Acked-by: Niklas Schnelle <[email protected]> Acked-by: Michael Ellerman <[email protected]> (powerpc) Signed-off-by: Christoph Hellwig <[email protected]>
2020-09-03x86/mm/32: Bring back vmalloc faulting on x86_32Joerg Roedel1-0/+78
One can not simply remove vmalloc faulting on x86-32. Upstream commit: 7f0a002b5a21 ("x86/mm: remove vmalloc faulting") removed it on x86 alltogether because previously the arch_sync_kernel_mappings() interface was introduced. This interface added synchronization of vmalloc/ioremap page-table updates to all page-tables in the system at creation time and was thought to make vmalloc faulting obsolete. But that assumption was incredibly naive. It turned out that there is a race window between the time the vmalloc or ioremap code establishes a mapping and the time it synchronizes this change to other page-tables in the system. During this race window another CPU or thread can establish a vmalloc mapping which uses the same intermediate page-table entries (e.g. PMD or PUD) and does no synchronization in the end, because it found all necessary mappings already present in the kernel reference page-table. But when these intermediate page-table entries are not yet synchronized, the other CPU or thread will continue with a vmalloc address that is not yet mapped in the page-table it currently uses, causing an unhandled page fault and oops like below: BUG: unable to handle page fault for address: fe80c000 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page *pde = 33183067 *pte = a8648163 Oops: 0002 [#1] SMP CPU: 1 PID: 13514 Comm: cve-2017-17053 Tainted: G ... Call Trace: ldt_dup_context+0x66/0x80 dup_mm+0x2b3/0x480 copy_process+0x133b/0x15c0 _do_fork+0x94/0x3e0 __ia32_sys_clone+0x67/0x80 __do_fast_syscall_32+0x3f/0x70 do_fast_syscall_32+0x29/0x60 do_SYSENTER_32+0x15/0x20 entry_SYSENTER_32+0x9f/0xf2 EIP: 0xb7eef549 So the arch_sync_kernel_mappings() interface is racy, but removing it would mean to re-introduce the vmalloc_sync_all() interface, which is even more awful. Keep arch_sync_kernel_mappings() in place and catch the race condition in the page-fault handler instead. Do a partial revert of above commit to get vmalloc faulting on x86-32 back in place. Fixes: 7f0a002b5a21 ("x86/mm: remove vmalloc faulting") Reported-by: Naresh Kamboju <[email protected]> Signed-off-by: Joerg Roedel <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-03x86/cmdline: Disable jump tables for cmdline.cArvind Sankar1-1/+1
When CONFIG_RETPOLINE is disabled, Clang uses a jump table for the switch statement in cmdline_find_option (jump tables are disabled when CONFIG_RETPOLINE is enabled). This function is called very early in boot from sme_enable() if CONFIG_AMD_MEM_ENCRYPT is enabled. At this time, the kernel is still executing out of the identity mapping, but the jump table will contain virtual addresses. Fix this by disabling jump tables for cmdline.c when AMD_MEM_ENCRYPT is enabled. Signed-off-by: Arvind Sankar <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-03x86/boot/compressed: Warn on orphan section placementKees Cook1-0/+1
We don't want to depend on the linker's orphan section placement heuristics as these can vary between linkers, and may change between versions. All sections need to be explicitly handled in the linker script. Now that all sections are explicitly handled, enable orphan section warnings. Signed-off-by: Kees Cook <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Reviewed-by: Nick Desaulniers <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-03x86/build: Warn on orphan section placementKees Cook1-0/+4
We don't want to depend on the linker's orphan section placement heuristics as these can vary between linkers, and may change between versions. All sections need to be explicitly handled in the linker script. Now that all sections are explicitly handled, enable orphan section warnings. Signed-off-by: Kees Cook <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Reviewed-by: Nick Desaulniers <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller1-11/+21
Daniel Borkmann says: ==================== pull-request: bpf-next 2020-09-01 The following pull-request contains BPF updates for your *net-next* tree. There are two small conflicts when pulling, resolve as follows: 1) Merge conflict in tools/lib/bpf/libbpf.c between 88a82120282b ("libbpf: Factor out common ELF operations and improve logging") in bpf-next and 1e891e513e16 ("libbpf: Fix map index used in error message") in net-next. Resolve by taking the hunk in bpf-next: [...] scn = elf_sec_by_idx(obj, obj->efile.btf_maps_shndx); data = elf_sec_data(obj, scn); if (!scn || !data) { pr_warn("elf: failed to get %s map definitions for %s\n", MAPS_ELF_SEC, obj->path); return -EINVAL; } [...] 2) Merge conflict in drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c between 9647c57b11e5 ("xsk: i40e: ice: ixgbe: mlx5: Test for dma_need_sync earlier for better performance") in bpf-next and e20f0dbf204f ("net/mlx5e: RX, Add a prefetch command for small L1_CACHE_BYTES") in net-next. Resolve the two locations by retaining net_prefetch() and taking xsk_buff_dma_sync_for_cpu() from bpf-next. Should look like: [...] xdp_set_data_meta_invalid(xdp); xsk_buff_dma_sync_for_cpu(xdp, rq->xsk_pool); net_prefetch(xdp->data); [...] We've added 133 non-merge commits during the last 14 day(s) which contain a total of 246 files changed, 13832 insertions(+), 3105 deletions(-). The main changes are: 1) Initial support for sleepable BPF programs along with bpf_copy_from_user() helper for tracing to reliably access user memory, from Alexei Starovoitov. 2) Add BPF infra for writing and parsing TCP header options, from Martin KaFai Lau. 3) bpf_d_path() helper for returning full path for given 'struct path', from Jiri Olsa. 4) AF_XDP support for shared umems between devices and queues, from Magnus Karlsson. 5) Initial prep work for full BPF-to-BPF call support in libbpf, from Andrii Nakryiko. 6) Generalize bpf_sk_storage map & add local storage for inodes, from KP Singh. 7) Implement sockmap/hash updates from BPF context, from Lorenz Bauer. 8) BPF xor verification for scalar types & add BPF link iterator, from Yonghong Song. 9) Use target's prog type for BPF_PROG_TYPE_EXT prog verification, from Udip Pant. 10) Rework BPF tracing samples to use libbpf loader, from Daniel T. Lee. 11) Fix xdpsock sample to really cycle through all buffers, from Weqaar Janjua. 12) Improve type safety for tun/veth XDP frame handling, from Maciej Żenczykowski. 13) Various smaller cleanups and improvements all over the place. ==================== Signed-off-by: David S. Miller <[email protected]>
2020-09-01x86/PCI: Fix intel_mid_pci.c build error when ACPI is not enabledRandy Dunlap1-0/+1
Fix build error when CONFIG_ACPI is not set/enabled by adding the header file <asm/acpi.h> which contains a stub for the function in the build error. ../arch/x86/pci/intel_mid_pci.c: In function ‘intel_mid_pci_init’: ../arch/x86/pci/intel_mid_pci.c:303:2: error: implicit declaration of function ‘acpi_noirq_set’; did you mean ‘acpi_irq_get’? [-Werror=implicit-function-declaration] acpi_noirq_set(); Fixes: a912a7584ec3 ("x86/platform/intel-mid: Move PCI initialization to arch_init()") Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Randy Dunlap <[email protected]> Signed-off-by: Bjorn Helgaas <[email protected]> Reviewed-by: Andy Shevchenko <[email protected]> Reviewed-by: Jesse Barnes <[email protected]> Acked-by: Thomas Gleixner <[email protected]> Cc: [email protected] # v4.16+ Cc: Jacob Pan <[email protected]> Cc: Len Brown <[email protected]> Cc: Jesse Barnes <[email protected]> Cc: Arjan van de Ven <[email protected]>
2020-09-01x86/boot/compressed: Add missing debugging sections to outputKees Cook1-0/+2
Include the missing DWARF and STABS sections in the compressed image, when they are present. Signed-off-by: Kees Cook <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-01x86/boot/compressed: Remove, discard, or assert for unwanted sectionsKees Cook2-2/+13
In preparation for warning on orphan sections, stop the linker from generating the .eh_frame* sections, discard unwanted non-zero-sized generated sections, and enforce other expected-to-be-zero-sized sections (since discarding them might hide problems with them suddenly gaining unexpected entries). Signed-off-by: Kees Cook <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-01x86/boot/compressed: Reorganize zero-size section assertsKees Cook1-18/+26
For readability, move the zero-sized sections to the end after DISCARDS. Signed-off-by: Kees Cook <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-01x86/build: Add asserts for unwanted sectionsKees Cook1-0/+24
In preparation for warning on orphan sections, enforce other expected-to-be-zero-sized sections (since discarding them might hide problems with them suddenly gaining unexpected entries). Signed-off-by: Kees Cook <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-01x86/build: Enforce an empty .got.plt sectionKees Cook1-1/+13
The .got.plt section should always be zero (or filled only with the linker-generated lazy dispatch entry). Enforce this with an assert and mark the section as INFO. This is more sensitive than just blindly discarding the section. Signed-off-by: Kees Cook <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-01x86/asm: Avoid generating unused kprobe sectionsKees Cook1-1/+5
When !CONFIG_KPROBES, do not generate kprobe sections. This makes sure there are no unexpected sections encountered by the linker scripts. Signed-off-by: Kees Cook <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-01x86/perf, static_call: Optimize x86_pmu methodsPeter Zijlstra1-40/+94
Replace many of the indirect calls with static_call(). The average PMI time, as measured by perf_sample_event_took()*: PRE: 3283.03 [ns] POST: 3145.12 [ns] Which is a ~138 [ns] win per PMI, or a ~4.2% decrease. [*] on an IVB-EP, using: 'perf record -a -e cycles -- make O=defconfig-build/ -j80' Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Cc: Linus Torvalds <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-01static_call: Allow early initPeter Zijlstra2-1/+6
In order to use static_call() to wire up x86_pmu, we need to initialize earlier, specifically before memory allocation works; copy some of the tricks from jump_label to enable this. Primarily we overload key->next to store a sites pointer when there are no modules, this avoids having to use kmalloc() to initialize the sites and allows us to run much earlier. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Reviewed-by: Steven Rostedt (VMware) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-01static_call: Add some validationPeter Zijlstra1-2/+26
Verify the text we're about to change is as we expect it to be. Requested-by: Steven Rostedt <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-01static_call: Handle tail-callsPeter Zijlstra1-3/+18
GCC can turn our static_call(name)(args...) into a tail call, in which case we get a JMP.d32 into the trampoline (which then does a further tail-call). Teach objtool to recognise and mark these in .static_call_sites and adjust the code patching to deal with this. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Cc: Linus Torvalds <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-01static_call: Add static_call_cond()Peter Zijlstra2-13/+41
Extend the static_call infrastructure to optimize the following common pattern: if (func_ptr) func_ptr(args...) For the trampoline (which is in effect a tail-call), we patch the JMP.d32 into a RET, which then directly consumes the trampoline call. For the in-line sites we replace the CALL with a NOP5. NOTE: this is 'obviously' limited to functions with a 'void' return type. NOTE: DEFINE_STATIC_COND_CALL() only requires a typename, as opposed to a full function. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Cc: Linus Torvalds <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-01x86/alternatives: Teach text_poke_bp() to emulate RETPeter Zijlstra2-0/+24
Future patches will need to poke a RET instruction, provide the infrastructure required for this. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Reviewed-by: Steven Rostedt (VMware) <[email protected]> Cc: Masami Hiramatsu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-01x86/static_call: Add inline static call implementation for x86-64Josh Poimboeuf4-2/+18
Add the inline static call implementation for x86-64. The generated code is identical to the out-of-line case, except we move the trampoline into it's own section. Objtool uses the trampoline naming convention to detect all the call sites. It then annotates those call sites in the .static_call_sites section. During boot (and module init), the call sites are patched to call directly into the destination function. The temporary trampoline is then no longer used. [peterz: merged trampolines, put trampoline in section] Signed-off-by: Josh Poimboeuf <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Cc: Linus Torvalds <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-01x86/static_call: Add out-of-line static call implementationJosh Poimboeuf4-0/+56
Add the x86 out-of-line static call implementation. For each key, a permanent trampoline is created which is the destination for all static calls for the given key. The trampoline has a direct jump which gets patched by static_call_update() when the destination function changes. [peterz: fixed trampoline, rewrote patching code] Signed-off-by: Josh Poimboeuf <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Cc: Linus Torvalds <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-01static_call: Avoid kprobes on inline static_call()sPeter Zijlstra1-1/+3
Similar to how we disallow kprobes on any other dynamic text (ftrace/jump_label) also disallow kprobes on inline static_call()s. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-01vmlinux.lds.h: Split ELF_DETAILS from STABS_DEBUGKees Cook2-0/+3
The .comment section doesn't belong in STABS_DEBUG. Split it out into a new macro named ELF_DETAILS. This will gain other non-debug sections that need to be accounted for when linking with --orphan-handling=warn. Signed-off-by: Kees Cook <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/r/[email protected]
2020-08-30x86/kvm: Expose TSX Suspend Load Tracking featureCathy Zhang1-1/+1
TSX suspend load tracking instruction is supported by the Intel uarch Sapphire Rapids. It aims to give a way to choose which memory accesses do not need to be tracked in the TSX read set. It's availability is indicated as CPUID.(EAX=7,ECX=0):EDX[bit 16]. Expose TSX Suspend Load Address Tracking feature in KVM CPUID, so KVM could pass this information to guests and they can make use of this feature accordingly. Signed-off-by: Cathy Zhang <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Tony Luck <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-08-30Merge tag 'x86-urgent-2020-08-30' of ↵Linus Torvalds2-13/+29
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: "Three interrupt related fixes for X86: - Move disabling of the local APIC after invoking fixup_irqs() to ensure that interrupts which are incoming are noted in the IRR and not ignored. - Unbreak affinity setting. The rework of the entry code reused the regular exception entry code for device interrupts. The vector number is pushed into the errorcode slot on the stack which is then lifted into an argument and set to -1 because that's regs->orig_ax which is used in quite some places to check whether the entry came from a syscall. But it was overlooked that orig_ax is used in the affinity cleanup code to validate whether the interrupt has arrived on the new target. It turned out that this vector check is pointless because interrupts are never moved from one vector to another on the same CPU. That check is a historical leftover from the time where x86 supported multi-CPU affinities, but not longer needed with the now strict single CPU affinity. Famous last words ... - Add a missing check for an empty cpumask into the matrix allocator. The affinity change added a warning to catch the case where an interrupt is moved on the same CPU to a different vector. This triggers because a condition with an empty cpumask returns an assignment from the allocator as the allocator uses for_each_cpu() without checking the cpumask for being empty. The historical inconsistent for_each_cpu() behaviour of ignoring the cpumask and unconditionally claiming that CPU0 is in the mask struck again. Sigh. plus a new entry into the MAINTAINER file for the HPE/UV platform" * tag 'x86-urgent-2020-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq/matrix: Deal with the sillyness of for_each_cpu() on UP x86/irq: Unbreak interrupt affinity setting x86/hotplug: Silence APIC only after all interrupts are migrated MAINTAINERS: Add entry for HPE Superdome Flex (UV) maintainers
2020-08-30Merge tag 'locking-urgent-2020-08-30' of ↵Linus Torvalds4-20/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fixes from Thomas Gleixner: "A set of fixes for lockdep, tracing and RCU: - Prevent recursion by using raw_cpu_* operations - Fixup the interrupt state in the cpu idle code to be consistent - Push rcu_idle_enter/exit() invocations deeper into the idle path so that the lock operations are inside the RCU watching sections - Move trace_cpu_idle() into generic code so it's called before RCU goes idle. - Handle raw_local_irq* vs. local_irq* operations correctly - Move the tracepoints out from under the lockdep recursion handling which turned out to be fragile and inconsistent" * tag 'locking-urgent-2020-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: lockdep,trace: Expose tracepoints lockdep: Only trace IRQ edges mips: Implement arch_irqs_disabled() arm64: Implement arch_irqs_disabled() nds32: Implement arch_irqs_disabled() locking/lockdep: Cleanup x86/entry: Remove unused THUNKs cpuidle: Move trace_cpu_idle() into generic code cpuidle: Make CPUIDLE_FLAG_TLB_FLUSHED generic sched,idle,rcu: Push rcu_idle deeper into the idle path cpuidle: Fixup IRQ state lockdep: Use raw_cpu_*() for per-cpu variables
2020-08-30x86/cpufeatures: Enumerate TSX suspend load address tracking instructionsKyung Min Park1-0/+1
Intel TSX suspend load tracking instructions aim to give a way to choose which memory accesses do not need to be tracked in the TSX read set. Add TSX suspend load tracking CPUID feature flag TSXLDTRK for enumeration. A processor supports Intel TSX suspend load address tracking if CPUID.0x07.0x0:EDX[16] is present. Two instructions XSUSLDTRK, XRESLDTRK are available when this feature is present. The CPU feature flag is shown as "tsxldtrk" in /proc/cpuinfo. Signed-off-by: Kyung Min Park <[email protected]> Signed-off-by: Cathy Zhang <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Tony Luck <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-08-28bpf: Introduce sleepable BPF programsAlexei Starovoitov1-11/+21
Introduce sleepable BPF programs that can request such property for themselves via BPF_F_SLEEPABLE flag at program load time. In such case they will be able to use helpers like bpf_copy_from_user() that might sleep. At present only fentry/fexit/fmod_ret and lsm programs can request to be sleepable and only when they are attached to kernel functions that are known to allow sleeping. The non-sleepable programs are relying on implicit rcu_read_lock() and migrate_disable() to protect life time of programs, maps that they use and per-cpu kernel structures used to pass info between bpf programs and the kernel. The sleepable programs cannot be enclosed into rcu_read_lock(). migrate_disable() maps to preempt_disable() in non-RT kernels, so the progs should not be enclosed in migrate_disable() as well. Therefore rcu_read_lock_trace is used to protect the life time of sleepable progs. There are many networking and tracing program types. In many cases the 'struct bpf_prog *' pointer itself is rcu protected within some other kernel data structure and the kernel code is using rcu_dereference() to load that program pointer and call BPF_PROG_RUN() on it. All these cases are not touched. Instead sleepable bpf programs are allowed with bpf trampoline only. The program pointers are hard-coded into generated assembly of bpf trampoline and synchronize_rcu_tasks_trace() is used to protect the life time of the program. The same trampoline can hold both sleepable and non-sleepable progs. When rcu_read_lock_trace is held it means that some sleepable bpf program is running from bpf trampoline. Those programs can use bpf arrays and preallocated hash/lru maps. These map types are waiting on programs to complete via synchronize_rcu_tasks_trace(); Updates to trampoline now has to do synchronize_rcu_tasks_trace() and synchronize_rcu_tasks() to wait for sleepable progs to finish and for trampoline assembly to finish. This is the first step of introducing sleepable progs. Eventually dynamically allocated hash maps can be allowed and networking program types can become sleepable too. Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Acked-by: KP Singh <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2020-08-27x86/mpparse: Remove duplicate io_apic.h includeWang Hai1-1/+0
Remove asm/io_apic.h which is included more than once. Reported-by: Hulk Robot <[email protected]> Signed-off-by: Wang Hai <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Pekka Enberg <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-08-27x86/irq: Unbreak interrupt affinity settingThomas Gleixner1-7/+9
Several people reported that 5.8 broke the interrupt affinity setting mechanism. The consolidation of the entry code reused the regular exception entry code for device interrupts and changed the way how the vector number is conveyed from ptregs->orig_ax to a function argument. The low level entry uses the hardware error code slot to push the vector number onto the stack which is retrieved from there into a function argument and the slot on stack is set to -1. The reason for setting it to -1 is that the error code slot is at the position where pt_regs::orig_ax is. A positive value in pt_regs::orig_ax indicates that the entry came via a syscall. If it's not set to a negative value then a signal delivery on return to userspace would try to restart a syscall. But there are other places which rely on pt_regs::orig_ax being a valid indicator for syscall entry. But setting pt_regs::orig_ax to -1 has a nasty side effect vs. the interrupt affinity setting mechanism, which was overlooked when this change was made. Moving interrupts on x86 happens in several steps. A new vector on a different CPU is allocated and the relevant interrupt source is reprogrammed to that. But that's racy and there might be an interrupt already in flight to the old vector. So the old vector is preserved until the first interrupt arrives on the new vector and the new target CPU. Once that happens the old vector is cleaned up, but this cleanup still depends on the vector number being stored in pt_regs::orig_ax, which is now -1. That -1 makes the check for cleanup: pt_regs::orig_ax == new_vector always false. As a consequence the interrupt is moved once, but then it cannot be moved anymore because the cleanup of the old vector never happens. There would be several ways to convey the vector information to that place in the guts of the interrupt handling, but on deeper inspection it turned out that this check is pointless and a leftover from the old affinity model of X86 which supported multi-CPU affinities. Under this model it was possible that an interrupt had an old and a new vector on the same CPU, so the vector match was required. Under the new model the effective affinity of an interrupt is always a single CPU from the requested affinity mask. If the affinity mask changes then either the interrupt stays on the CPU and on the same vector when that CPU is still in the new affinity mask or it is moved to a different CPU, but it is never moved to a different vector on the same CPU. Ergo the cleanup check for the matching vector number is not required and can be removed which makes the dependency on pt_regs:orig_ax go away. The remaining check for new_cpu == smp_processsor_id() is completely sufficient. If it matches then the interrupt was successfully migrated and the cleanup can proceed. For paranoia sake add a warning into the vector assignment code to validate that the assumption of never moving to a different vector on the same CPU holds. Fixes: 633260fa143 ("x86/irq: Convey vector as argument and not in ptregs") Reported-by: Alex bykov <[email protected]> Reported-by: Avi Kivity <[email protected]> Reported-by: Alexander Graf <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Alexander Graf <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/r/[email protected]
2020-08-27x86/hotplug: Silence APIC only after all interrupts are migratedAshok Raj1-6/+20
There is a race when taking a CPU offline. Current code looks like this: native_cpu_disable() { ... apic_soft_disable(); /* * Any existing set bits for pending interrupt to * this CPU are preserved and will be sent via IPI * to another CPU by fixup_irqs(). */ cpu_disable_common(); { .... /* * Race window happens here. Once local APIC has been * disabled any new interrupts from the device to * the old CPU are lost */ fixup_irqs(); // Too late to capture anything in IRR. ... } } The fix is to disable the APIC *after* cpu_disable_common(). Testing was done with a USB NIC that provided a source of frequent interrupts. A script migrated interrupts to a specific CPU and then took that CPU offline. Fixes: 60dcaad5736f ("x86/hotplug: Silence APIC and NMI when CPU is dead") Reported-by: Evan Green <[email protected]> Signed-off-by: Ashok Raj <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Mathias Nyman <[email protected]> Tested-by: Evan Green <[email protected]> Reviewed-by: Evan Green <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/lkml/[email protected]/ Link: https://lore.kernel.org/r/[email protected]
2020-08-26x86/mce: Delay clearing IA32_MCG_STATUS to the end of do_machine_check()Tony Luck1-5/+4
A long time ago, Linux cleared IA32_MCG_STATUS at the very end of machine check processing. Then, some fancy recovery and IST manipulation was added in: d4812e169de4 ("x86, mce: Get rid of TIF_MCE_NOTIFY and associated mce tricks") and clearing IA32_MCG_STATUS was pulled earlier in the function. Next change moved the actual recovery out of do_machine_check() and just used task_work_add() to schedule it later (before returning to the user): 5567d11c21a1 ("x86/mce: Send #MC singal from task work") Most recently the fancy IST footwork was removed as no longer needed: b052df3da821 ("x86/entry: Get rid of ist_begin/end_non_atomic()") At this point there is no reason remaining to clear IA32_MCG_STATUS early. It can move back to the very end of the function. Also move sync_core(). The comments for this function say that it should only be called when instructions have been changed/re-mapped. Recovery for an instruction fetch may change the physical address. But that doesn't happen until the scheduled work runs (which could be on another CPU). [ bp: Massage commit message. ] Reported-by: Gabriele Paoloni <[email protected]> Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-08-26x86/resctrl: Enable user to view thread or core throttling modeFenghua Yu3-7/+87
Early Intel hardware implementations of Memory Bandwidth Allocation (MBA) could only control bandwidth at the processor core level. This meant that when two processes with different bandwidth allocations ran simultaneously on the same core the hardware had to resolve this difference. It did so by applying the higher throttling value (lower bandwidth) to both processes. Newer implementations can apply different throttling values to each thread on a core. Introduce a new resctrl file, "thread_throttle_mode", on Intel systems that shows to the user how throttling values are allocated, per-core or per-thread. On systems that support per-core throttling, the file will display "max". On newer systems that support per-thread throttling, the file will display "per-thread". AMD confirmed in [1] that AMD bandwidth allocation is already at thread level but that the AMD implementation does not use a memory delay throttle mode. So to avoid confusion the thread throttling mode would be UNDEFINED on AMD systems and the "thread_throttle_mode" file will not be visible. Originally-by: Reinette Chatre <[email protected]> Signed-off-by: Fenghua Yu <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Link: [1] https://lore.kernel.org/lkml/[email protected]
2020-08-26x86/resctrl: Enumerate per-thread MBA controlsFenghua Yu3-0/+3
Some systems support per-thread Memory Bandwidth Allocation (MBA) which applies a throttling delay value to each hardware thread instead of to a core. Per-thread MBA is enumerated by CPUID. No feature flag is shown in /proc/cpuinfo. User applications need to check a resctrl throttling mode info file to know if the feature is supported. Signed-off-by: Fenghua Yu <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Babu Moger <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-08-26x86/entry: Remove unused THUNKsPeter Zijlstra1-5/+0
Unused remnants Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Steven Rostedt (VMware) <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Acked-by: Rafael J. Wysocki <[email protected]> Tested-by: Marco Elver <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-08-26cpuidle: Move trace_cpu_idle() into generic codePeter Zijlstra1-4/+0
Remove trace_cpu_idle() from the arch_cpu_idle() implementations and put it in the generic code, right before disabling RCU. Gets rid of more trace_*_rcuidle() users. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Steven Rostedt (VMware) <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Acked-by: Rafael J. Wysocki <[email protected]> Tested-by: Marco Elver <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-08-26cpuidle: Make CPUIDLE_FLAG_TLB_FLUSHED genericPeter Zijlstra2-11/+3
This allows moving the leave_mm() call into generic code before rcu_idle_enter(). Gets rid of more trace_*_rcuidle() users. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Steven Rostedt (VMware) <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Acked-by: Rafael J. Wysocki <[email protected]> Tested-by: Marco Elver <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-08-24kvm: mmu: page_track: Fix RCU list API usageMadhuparna Bhowmik1-2/+4
Use hlist_for_each_entry_srcu() instead of hlist_for_each_entry_rcu() as it also checkes if the right lock is held. Using hlist_for_each_entry_rcu() with a condition argument will not report the cases where a SRCU protected list is traversed using rcu_read_lock(). Hence, use hlist_for_each_entry_srcu(). Signed-off-by: Madhuparna Bhowmik <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> Acked-by: Paolo Bonzini <[email protected]> Cc: <[email protected]>
2020-08-24x86/fsgsbase: Replace static_cpu_has() with boot_cpu_has()Borislav Petkov2-6/+6
ptrace and prctl() are not really fast paths to warrant the use of static_cpu_has() and cause alternatives patching for no good reason. Replace with boot_cpu_has() which is simple and fast enough. No functional changes. Signed-off-by: Borislav Petkov <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-08-24x86/entry/64: Correct the comment over SAVE_AND_SET_GSBASEBorislav Petkov1-2/+3
Add the proper explanation why an LFENCE is not needed in the FSGSBASE case. Fixes: c82965f9e530 ("x86/entry/64: Handle FSGSBASE enabled paranoid entry/exit") Signed-off-by: Borislav Petkov <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-08-23treewide: Use fallthrough pseudo-keywordGustavo A. R. Silva31-57/+52
Replace the existing /* fall through */ comments and its variants with the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary fall-through markings when it is the case. [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through Signed-off-by: Gustavo A. R. Silva <[email protected]>
2020-08-23Merge tag 'x86-urgent-2020-08-23' of ↵Linus Torvalds1-4/+6
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fix from Thomas Gleixner: "A single fix for x86 which removes the RDPID usage from the paranoid entry path and unconditionally uses LSL to retrieve the CPU number. RDPID depends on MSR_TSX_AUX. KVM has an optmization to avoid expensive MRS read/writes on VMENTER/EXIT. It caches the MSR values and restores them either when leaving the run loop, on preemption or when going out to user space. MSR_TSX_AUX is part of that lazy MSR set, so after writing the guest value and before the lazy restore any exception using the paranoid entry will read the guest value and use it as CPU number to retrieve the GSBASE value for the current CPU when FSGSBASE is enabled. As RDPID is only used in that particular entry path, there is no reason to burden VMENTER/EXIT with two extra MSR writes. Remove the RDPID optimization, which is not even backed by numbers from the paranoid entry path instead" * tag 'x86-urgent-2020-08-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/entry/64: Do not use RDPID in paranoid entry to accomodate KVM
2020-08-23Merge tag 'perf-urgent-2020-08-23' of ↵Linus Torvalds1-3/+49
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 perf fix from Thomas Gleixner: "A single update for perf on x86 which has support for the broken down bandwith counters" * tag 'perf-urgent-2020-08-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/intel/uncore: Add BW counters for GT, IA and IO breakdown
2020-08-23Merge tag 'efi-urgent-2020-08-23' of ↵Linus Torvalds4-86/+39
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull EFI fixes from Thomas Gleixner: - Enforce NX on RO data in mixed EFI mode - Destroy workqueue in an error handling path to prevent UAF - Stop argument parser at '--' which is the delimiter for init - Treat a NULL command line pointer as empty instead of dereferncing it unconditionally. - Handle an unterminated command line correctly - Cleanup the 32bit code leftovers and remove obsolete documentation * tag 'efi-urgent-2020-08-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: Documentation: efi: remove description of efi=old_map efi/x86: Move 32-bit code into efi_32.c efi/libstub: Handle unterminated cmdline efi/libstub: Handle NULL cmdline efi/libstub: Stop parsing arguments at "--" efi: add missed destroy_workqueue when efisubsys_init fails efi/x86: Mark kernel rodata non-executable for mixed mode
2020-08-22Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds3-4/+8
Pull kvm fixes from Paolo Bonzini: - PAE and PKU bugfixes for x86 - selftests fix for new binutils - MMU notifier fix for arm64 * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: arm64: Only reschedule if MMU_NOTIFIER_RANGE_BLOCKABLE is not set KVM: Pass MMU notifier range flags to kvm_unmap_hva_range() kvm: x86: Toggling CR4.PKE does not load PDPTEs in PAE mode kvm: x86: Toggling CR4.SMAP does not load PDPTEs in PAE mode KVM: x86: fix access code passed to gva_to_gpa selftests: kvm: Use a shorter encoding to clear RAX