aboutsummaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)AuthorFilesLines
2020-09-24perf/x86/intel/uncore: Factor out uncore_pci_find_dev_pmu()Kan Liang1-15/+33
When an uncore PCI sub driver gets a remove notification, the corresponding PMU has to be retrieved and unregistered. The codes, which find the corresponding PMU by comparing the pci_device_id table, can be shared. Factor out uncore_pci_find_dev_pmu(), which will be used later. There is no functional change. Signed-off-by: Kan Liang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-09-24perf/x86/intel/uncore: Factor out uncore_pci_get_dev_die_info()Kan Liang1-8/+23
The socket and die information is required to register/unregister a PMU in the uncore PCI sub driver. The codes, which get the socket and die information from a BUS number, can be shared. Factor out uncore_pci_get_dev_die_info(), which will be used later. There is no functional change. Signed-off-by: Kan Liang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-09-24perf/amd/uncore: Inform the user how many counters each uncore PMU hasKim Phillips1-6/+7
Previously, the uncore driver would say "NB counters detected" on F17h machines, which don't have NorthBridge (NB) counters. They have Data Fabric (DF) counters. Just use the pmu.name to inform users which pmu to use and its associated counter count. F17h dmesg BEFORE: amd_uncore: AMD NB counters detected amd_uncore: AMD LLC counters detected F17h dmesg AFTER: amd_uncore: 4 amd_df counters detected amd_uncore: 6 amd_l3 counters detected Signed-off-by: Kim Phillips <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-09-24perf/amd/uncore: Allow F19h user coreid, threadmask, and sliceid specificationKim Phillips1-5/+32
On Family 19h, the driver checks for a populated 2-bit threadmask in order to establish that the user wants to measure individual slices, individual cores (only one can be measured at a time), and lets the user also directly specify enallcores and/or enallslices if desired. Example F19h invocation to measure L3 accesses (event 4, umask 0xff) by the first thread (id 0 -> mask 0x1) of the first core (id 0) on the first slice (id 0): perf stat -a -e instructions,amd_l3/umask=0xff,event=0x4,coreid=0,threadmask=1,sliceid=0,enallcores=0,enallslices=0/ <workload> Signed-off-by: Kim Phillips <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-09-24perf/amd/uncore: Allow F17h user threadmask and slicemask specificationKim Phillips1-7/+16
Continue to fully populate either one of threadmask or slicemask if the user doesn't. Signed-off-by: Kim Phillips <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-09-24perf/amd/uncore: Prepare to scale for more attributes that vary per familyKim Phillips1-50/+61
Replace AMD_FORMAT_ATTR with the more apropos DEFINE_UNCORE_FORMAT_ATTR stolen from arch/x86/events/intel/uncore.h. This way we can clearly see the bit-variants of each of the attributes that want to have the same name across families. Also unroll AMD_ATTRIBUTE because we are going to separately add new attributes that differ between DF and L3. Also clean up the if-Family 17h-else logic in amd_uncore_init. This is basically a rewrite of commit da6adaea2b7e ("perf/x86/amd/uncore: Update sysfs attributes for Family17h processors"). No functional changes. Tested F17h+ /sys/bus/event_source/devices/amd_{l3,df}/format/* content remains unchanged: /sys/bus/event_source/devices/amd_l3/format/event:config:0-7 /sys/bus/event_source/devices/amd_l3/format/umask:config:8-15 /sys/bus/event_source/devices/amd_df/format/event:config:0-7,32-35,59-60 /sys/bus/event_source/devices/amd_df/format/umask:config:8-15 Signed-off-by: Kim Phillips <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-09-23x86/ioapic: Unbreak check_timer()Thomas Gleixner1-0/+1
Several people reported in the kernel bugzilla that between v4.12 and v4.13 the magic which works around broken hardware and BIOSes to find the proper timer interrupt delivery mode stopped working for some older affected platforms which need to fall back to ExtINT delivery mode. The reason is that the core code changed to keep track of the masked and disabled state of an interrupt line more accurately to avoid the expensive hardware operations. That broke an assumption in i8259_make_irq() which invokes disable_irq_nosync(); irq_set_chip_and_handler(); enable_irq(); Up to v4.12 this worked because enable_irq() unconditionally unmasked the interrupt line, but after the state tracking improvements this is not longer the case because the IO/APIC uses lazy disabling. So the line state is unmasked which means that enable_irq() does not call into the new irq chip to unmask it. In principle this is a shortcoming of the core code, but it's more than unclear whether the core code should try to reset state. At least this cannot be done unconditionally as that would break other existing use cases where the chip type is changed, e.g. when changing the trigger type, but the callers expect the state to be preserved. As the way how check_timer() is switching the delivery modes is truly unique, the obvious fix is to simply unmask the i8259 manually after changing the mode to ExtINT delivery and switching the irq chip to the legacy PIC. Note, that the fixes tag is not really precise, but identifies the commit which broke the assumptions in the IO/APIC and i8259 code and that's the kernel version to which this needs to be backported. Fixes: bf22ff45bed6 ("genirq: Avoid unnecessary low level irq function calls") Reported-by: [email protected] Reported-by: [email protected] Reported-by: [email protected] Reported-by: [email protected] Reported-by: [email protected] Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: [email protected] Tested-by: [email protected] Cc: [email protected] Link: https://bugzilla.kernel.org/show_bug.cgi?id=197769
2020-09-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller2-69/+212
Alexei Starovoitov says: ==================== pull-request: bpf-next 2020-09-23 The following pull-request contains BPF updates for your *net-next* tree. We've added 95 non-merge commits during the last 22 day(s) which contain a total of 124 files changed, 4211 insertions(+), 2040 deletions(-). The main changes are: 1) Full multi function support in libbpf, from Andrii. 2) Refactoring of function argument checks, from Lorenz. 3) Make bpf_tail_call compatible with functions (subprograms), from Maciej. 4) Program metadata support, from YiFei. 5) bpf iterator optimizations, from Yonghong. ==================== Signed-off-by: David S. Miller <[email protected]>
2020-09-23KVM: x86: VMX: Make smaller physical guest address space support ↵Mohammed Gamal3-7/+15
user-configurable This patch exposes allow_smaller_maxphyaddr to the user as a module parameter. Since smaller physical address spaces are only supported on VMX, the parameter is only exposed in the kvm_intel module. For now disable support by default, and let the user decide if they want to enable it. Modifications to VMX page fault and EPT violation handling will depend on whether that parameter is enabled. Signed-off-by: Mohammed Gamal <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-09-22fs: remove compat_sys_mountChristoph Hellwig1-1/+1
compat_sys_mount is identical to the regular sys_mount now, so remove it and use the native version everywhere. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Al Viro <[email protected]>
2020-09-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller23-82/+230
Two minor conflicts: 1) net/ipv4/route.c, adding a new local variable while moving another local variable and removing it's initial assignment. 2) drivers/net/dsa/microchip/ksz9477.c, overlapping changes. One pretty prints the port mode differently, whilst another changes the driver to try and obtain the port mode from the port node rather than the switch node. Signed-off-by: David S. Miller <[email protected]>
2020-09-22x86/irq: Make run_on_irqstack_cond() typesafeThomas Gleixner6-12/+67
Sami reported that run_on_irqstack_cond() requires the caller to cast functions to mismatching types, which trips indirect call Control-Flow Integrity (CFI) in Clang. Instead of disabling CFI on that function, provide proper helpers for the three call variants. The actual ASM code stays the same as that is out of reach. [ bp: Fix __run_on_irqstack() prototype to match. ] Fixes: 931b94145981 ("x86/entry: Provide helpers for executing on the irqstack") Reported-by: Nathan Chancellor <[email protected]> Reported-by: Sami Tolvanen <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Tested-by: Sami Tolvanen <[email protected]> Cc: <[email protected]> Link: https://github.com/ClangBuiltLinux/linux/issues/1052 Link: https://lkml.kernel.org/r/[email protected]
2020-09-22Merge branch 'x86-seves-for-paolo' of ↵Paolo Bonzini62-428/+623
https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into HEAD
2020-09-22x86/fpu: Handle FPU-related and clearcpuid command line arguments earlierMike Hommey2-55/+55
FPU initialization handles them currently. However, in the case of clearcpuid=, some other early initialization code may check for features before the FPU initialization code is called. Handling the argument earlier allows the command line to influence those early initializations. Signed-off-by: Mike Hommey <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-09-21Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds1-19/+3
Pull kvm fixes from Paolo Bonzini: "ARM: - fix fault on page table writes during instruction fetch s390: - doc improvement x86: - The obvious patches are always the ones that turn out to be completely broken. /me hangs his head in shame" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: Revert "KVM: Check the allocation of pv cpu mask" KVM: arm64: Remove S1PTW check from kvm_vcpu_dabt_iswrite() KVM: arm64: Assume write fault on S1PTW permission fault on instruction fetch docs: kvm: add documentation for KVM_CAP_S390_DIAG318
2020-09-20Merge tag 'x86_urgent_for_v5.9_rc6' of ↵Linus Torvalds4-1/+24
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - A defconfig fix (Daniel Díaz) - Disable relocation relaxation for the compressed kernel when not built as -pie as in that case kernels built with clang and linked with LLD fail to boot due to the linker optimizing some instructions in non-PIE form; the gory details in the commit message (Arvind Sankar) - A fix for the "bad bp value" warning issued by the frame-pointer unwinder (Josh Poimboeuf) * tag 'x86_urgent_for_v5.9_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/unwind/fp: Fix FP unwinding in ret_from_fork x86/boot/compressed: Disable relocation relaxation x86/defconfigs: Explicitly unset CONFIG_64BIT in i386_defconfig
2020-09-20Revert "KVM: Check the allocation of pv cpu mask"Vitaly Kuznetsov1-19/+3
The commit 0f990222108d ("KVM: Check the allocation of pv cpu mask") we have in 5.9-rc5 has two issue: 1) Compilation fails for !CONFIG_SMP, see: https://bugzilla.kernel.org/show_bug.cgi?id=209285 2) This commit completely disables PV TLB flush, see https://lore.kernel.org/kvm/[email protected]/ The allocation problem is likely a theoretical one, if we don't have memory that early in boot process we're likely doomed anyway. Let's solve it properly later. This reverts commit 0f990222108d214a0924d920e6095b58107d7b59. Signed-off-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-09-19KVM: SVM: Don't flush cache if hardware enforces cache coherency across ↵Krish Sadhukhan1-1/+2
encryption domains In some hardware implementations, coherency between the encrypted and unencrypted mappings of the same physical page in a VM is enforced. In such a system, it is not required for software to flush the VM's page from all CPU caches in the system prior to changing the value of the C-bit for the page. So check that bit before flushing the cache. Signed-off-by: Krish Sadhukhan <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Acked-by: Paolo Bonzini <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-09-18Merge tag 'pm-5.9-rc6' of ↵Linus Torvalds1-2/+0
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management updates from Rafael Wysocki: "These add a new CPU ID to the RAPL power capping driver and prevent the ACPI processor idle driver from triggering RCU-lockdep complaints. Specifics: - Add support for the Lakefield chip to the RAPL power capping driver (Ricardo Neri). - Modify the ACPI processor idle driver to prevent it from triggering RCU-lockdep complaints which has started to happen after recent changes in that area (Peter Zijlstra)" * tag 'pm-5.9-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: ACPI: processor: Take over RCU-idle for C3-BM idle cpuidle: Allow cpuidle drivers to take over RCU-idle ACPI: processor: Use CPUIDLE_FLAG_TLB_FLUSHED ACPI: processor: Use CPUIDLE_FLAG_TIMER_STOP powercap: RAPL: Add support for Lakefield
2020-09-18stacktrace: Remove reliable argument from arch_stack_walk() callbackMark Brown1-5/+5
Currently the callback passed to arch_stack_walk() has an argument called reliable passed to it to indicate if the stack entry is reliable, a comment says that this is used by some printk() consumers. However in the current kernel none of the arch_stack_walk() implementations ever set this flag to true and the only callback implementation we have is in the generic stacktrace code which ignores the flag. It therefore appears that this flag is redundant so we can simplify and clarify things by removing it. Signed-off-by: Mark Brown <[email protected]> Reviewed-by: Miroslav Benes <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]>
2020-09-18x86/mce: Annotate mce_rd/wrmsrl() with noinstrBorislav Petkov1-6/+21
They do get called from the #MC handler which is already marked "noinstr". Commit e2def7d49d08 ("x86/mce: Make mce_rdmsrl() panic on an inaccessible MSR") already got rid of the instrumentation in the MSR accessors, fix the annotation now too, in order to get rid of: vmlinux.o: warning: objtool: do_machine_check()+0x4a: call to mce_rdmsrl() leaves .noinstr.text section Signed-off-by: Borislav Petkov <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-09-18x86/mm/pat: Don't flush cache if hardware enforces cache coherency across ↵Krish Sadhukhan1-1/+1
encryption domnains In some hardware implementations, coherency between the encrypted and unencrypted mappings of the same physical page is enforced. In such a system, it is not required for software to flush the page from all CPU caches in the system prior to changing the value of the C-bit for the page. So check that bit before flushing the cache. [ bp: Massage commit message. ] Suggested-by: Tom Lendacky <[email protected]> Signed-off-by: Krish Sadhukhan <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-09-18x86/cpu: Add hardware-enforced cache coherency as a CPUID featureKrish Sadhukhan2-1/+2
In some hardware implementations, coherency between the encrypted and unencrypted mappings of the same physical page is enforced. In such a system, it is not required for software to flush the page from all CPU caches in the system prior to changing the value of the C-bit for a page. This hardware- enforced cache coherency is indicated by EAX[10] in CPUID leaf 0x8000001f. [ bp: Use one of the free slots in word 3. ] Suggested-by: Tom Lendacky <[email protected]> Signed-off-by: Krish Sadhukhan <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-09-18x86/unwind/fp: Fix FP unwinding in ret_from_forkJosh Poimboeuf2-1/+21
There have been some reports of "bad bp value" warnings printed by the frame pointer unwinder: WARNING: kernel stack regs at 000000005bac7112 in sh:1014 has bad 'bp' value 0000000000000000 This warning happens when unwinding from an interrupt in ret_from_fork(). If entry code gets interrupted, the state of the frame pointer (rbp) may be undefined, which can confuse the unwinder, resulting in warnings like the above. There's an in_entry_code() check which normally silences such warnings for entry code. But in this case, ret_from_fork() is getting interrupted. It recently got moved out of .entry.text, so the in_entry_code() check no longer works. It could be moved back into .entry.text, but that would break the noinstr validation because of the call to schedule_tail(). Instead, initialize each new task's RBP to point to the task's entry regs via an encoded frame pointer. That will allow the unwinder to reach the end of the stack gracefully. Fixes: b9f6976bfb94 ("x86/entry/64: Move non entry code into .text section") Reported-by: Naresh Kamboju <[email protected]> Reported-by: Logan Gunthorpe <[email protected]> Signed-off-by: Josh Poimboeuf <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Thomas Gleixner <[email protected]> Link: https://lkml.kernel.org/r/f366bbf5a8d02e2318ee312f738112d0af74d16f.1600103007.git.jpoimboe@redhat.com
2020-09-17bpf, x64: rework pro/epilogue and tailcall handling in JITMaciej Fijalkowski1-48/+189
This commit serves two things: 1) it optimizes BPF prologue/epilogue generation 2) it makes possible to have tailcalls within BPF subprogram Both points are related to each other since without 1), 2) could not be achieved. In [1], Alexei says: "The prologue will look like: nop5 xor eax,eax  // two new bytes if bpf_tail_call() is used in this // function push rbp mov rbp, rsp sub rsp, rounded_stack_depth push rax // zero init tail_call counter variable number of push rbx,r13,r14,r15 Then bpf_tail_call will pop variable number rbx,.. and final 'pop rax' Then 'add rsp, size_of_current_stack_frame' jmp to next function and skip over 'nop5; xor eax,eax; push rpb; mov rbp, rsp' This way new function will set its own stack size and will init tail call counter with whatever value the parent had. If next function doesn't use bpf_tail_call it won't have 'xor eax,eax'. Instead it would need to have 'nop2' in there." Implement that suggestion. Since the layout of stack is changed, tail call counter handling can not rely anymore on popping it to rbx just like it have been handled for constant prologue case and later overwrite of rbx with actual value of rbx pushed to stack. Therefore, let's use one of the register (%rcx) that is considered to be volatile/caller-saved and pop the value of tail call counter in there in the epilogue. Drop the BUILD_BUG_ON in emit_prologue and in emit_bpf_tail_call_indirect where instruction layout is not constant anymore. Introduce new poke target, 'tailcall_bypass' to poke descriptor that is dedicated for skipping the register pops and stack unwind that are generated right before the actual jump to target program. For case when the target program is not present, BPF program will skip the pop instructions and nop5 dedicated for jmpq $target. An example of such state when only R6 of callee saved registers is used by program: ffffffffc0513aa1: e9 0e 00 00 00 jmpq 0xffffffffc0513ab4 ffffffffc0513aa6: 5b pop %rbx ffffffffc0513aa7: 58 pop %rax ffffffffc0513aa8: 48 81 c4 00 00 00 00 add $0x0,%rsp ffffffffc0513aaf: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) ffffffffc0513ab4: 48 89 df mov %rbx,%rdi When target program is inserted, the jump that was there to skip pops/nop5 will become the nop5, so CPU will go over pops and do the actual tailcall. One might ask why there simply can not be pushes after the nop5? In the following example snippet: ffffffffc037030c: 48 89 fb mov %rdi,%rbx (...) ffffffffc0370332: 5b pop %rbx ffffffffc0370333: 58 pop %rax ffffffffc0370334: 48 81 c4 00 00 00 00 add $0x0,%rsp ffffffffc037033b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) ffffffffc0370340: 48 81 ec 00 00 00 00 sub $0x0,%rsp ffffffffc0370347: 50 push %rax ffffffffc0370348: 53 push %rbx ffffffffc0370349: 48 89 df mov %rbx,%rdi ffffffffc037034c: e8 f7 21 00 00 callq 0xffffffffc0372548 There is the bpf2bpf call (at ffffffffc037034c) right after the tailcall and jump target is not present. ctx is in %rbx register and BPF subprogram that we will call into on ffffffffc037034c is relying on it, e.g. it will pick ctx from there. Such code layout is therefore broken as we would overwrite the content of %rbx with the value that was pushed on the prologue. That is the reason for the 'bypass' approach. Special care needs to be taken during the install/update/remove of tailcall target. In case when target program is not present, the CPU must not execute the pop instructions that precede the tailcall. To address that, the following states can be defined: A nop, unwind, nop B nop, unwind, tail C skip, unwind, nop D skip, unwind, tail A is forbidden (lead to incorrectness). The state transitions between tailcall install/update/remove will work as follows: First install tail call f: C->D->B(f) * poke the tailcall, after that get rid of the skip Update tail call f to f': B(f)->B(f') * poke the tailcall (poke->tailcall_target) and do NOT touch the poke->tailcall_bypass Remove tail call: B(f')->C(f') * poke->tailcall_bypass is poked back to jump, then we wait the RCU grace period so that other programs will finish its execution and after that we are safe to remove the poke->tailcall_target Install new tail call (f''): C(f')->D(f'')->B(f''). * same as first step This way CPU can never be exposed to "unwind, tail" state. Last but not least, when tailcalls get mixed with bpf2bpf calls, it would be possible to encounter the endless loop due to clearing the tailcall counter if for example we would use the tailcall3-like from BPF selftests program that would be subprogram-based, meaning the tailcall would be present within the BPF subprogram. This test, broken down to particular steps, would do: entry -> set tailcall counter to 0, bump it by 1, tailcall to func0 func0 -> call subprog_tail (we are NOT skipping the first 11 bytes of prologue and this subprogram has a tailcall, therefore we clear the counter...) subprog -> do the same thing as entry and then loop forever. To address this, the idea is to go through the call chain of bpf2bpf progs and look for a tailcall presence throughout whole chain. If we saw a single tail call then each node in this call chain needs to be marked as a subprog that can reach the tailcall. We would later feed the JIT with this info and: - set eax to 0 only when tailcall is reachable and this is the entry prog - if tailcall is reachable but there's no tailcall in insns of currently JITed prog then push rax anyway, so that it will be possible to propagate further down the call chain - finally if tailcall is reachable, then we need to precede the 'call' insn with mov rax, [rbp - (stack_depth + 8)] Tail call related cases from test_verifier kselftest are also working fine. Sample BPF programs that utilize tail calls (sockex3, tracex5) work properly as well. [1]: https://lore.kernel.org/bpf/[email protected]/ Suggested-by: Alexei Starovoitov <[email protected]> Signed-off-by: Maciej Fijalkowski <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2020-09-17bpf: rename poke descriptor's 'ip' member to 'tailcall_target'Maciej Fijalkowski1-9/+11
Reflect the actual purpose of poke->ip and rename it to poke->tailcall_target so that it will not the be confused with another poke target that will be introduced in next commit. While at it, do the same thing with poke->ip_stable - rename it to poke->tailcall_target_stable. Signed-off-by: Maciej Fijalkowski <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2020-09-17bpf, x64: use %rcx instead of %rax for tail call retpolinesMaciej Fijalkowski2-18/+18
Currently, %rax is used to store the jump target when BPF program is emitting the retpoline instructions that are handling the indirect tailcall. There is a plan to use %rax for different purpose, which is storing the tail call counter. In order to preserve this value across the tailcalls, adjust the BPF indirect tailcalls so that the target program will reside in %rcx and teach the retpoline instructions about new location of jump target. Signed-off-by: Maciej Fijalkowski <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2020-09-17x86/mmu: Allocate/free a PASIDFenghua Yu3-0/+76
A PASID is allocated for an "mm" the first time any thread binds to an SVA-capable device and is freed from the "mm" when the SVA is unbound by the last thread. It's possible for the "mm" to have different PASID values in different binding/unbinding SVA cycles. The mm's PASID (non-zero for valid PASID or 0 for invalid PASID) is propagated to a per-thread PASID MSR for all threads within the mm through IPI, context switch, or inherited. This is done to ensure that a running thread has the right PASID in the MSR matching the mm's PASID. [ bp: s/SVM/SVA/g; massage. ] Suggested-by: Andy Lutomirski <[email protected]> Signed-off-by: Fenghua Yu <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Tony Luck <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-09-17x86/cpufeatures: Mark ENQCMD as disabled when configured outFenghua Yu1-1/+8
Currently, the ENQCMD feature depends on CONFIG_IOMMU_SUPPORT. Add X86_FEATURE_ENQCMD to the disabled features mask so that it gets disabled when the IOMMU config option above is not enabled. Signed-off-by: Fenghua Yu <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Tony Luck <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-09-17x86/msr-index: Define an IA32_PASID MSRFenghua Yu1-0/+3
The IA32_PASID MSR (0xd93) contains the Process Address Space Identifier (PASID), a 20-bit value. Bit 31 must be set to indicate the value programmed in the MSR is valid. Hardware uses the PASID to identify a process address space and direct responses to the right address space. Signed-off-by: Fenghua Yu <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Tony Luck <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-09-17x86/fpu/xstate: Add supervisor PASID state for ENQCMDYu-cheng Yu3-3/+16
The ENQCMD instruction reads a PASID from the IA32_PASID MSR. The MSR is stored in the task's supervisor XSAVE* PASID state and is context-switched by XSAVES/XRSTORS. [ bp: Add (in-)definite articles and massage. ] Signed-off-by: Yu-cheng Yu <[email protected]> Co-developed-by: Fenghua Yu <[email protected]> Signed-off-by: Fenghua Yu <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Tony Luck <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-09-17x86/cpufeatures: Enumerate ENQCMD and ENQCMDS instructionsFenghua Yu2-0/+2
Work submission instruction comes in two flavors. ENQCMD can be called both in ring 3 and ring 0 and always uses the contents of a PASID MSR when shipping the command to the device. ENQCMDS allows a kernel driver to submit commands on behalf of a user process. The driver supplies the PASID value in ENQCMDS. There isn't any usage of ENQCMD in the kernel as of now. The CPU feature flag is shown as "enqcmd" in /proc/cpuinfo. Signed-off-by: Fenghua Yu <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Tony Luck <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-09-17quota: simplify the quotactl compat handlingChristoph Hellwig1-1/+1
Fold the misaligned u64 workarounds into the main quotactl flow instead of implementing a separate compat syscall handler. Signed-off-by: Christoph Hellwig <[email protected]> Acked-by: Jan Kara <[email protected]> Signed-off-by: Al Viro <[email protected]>
2020-09-17compat: add a compat_need_64bit_alignment_fixup() helperChristoph Hellwig1-0/+1
Add a helper to check if the calling syscall needs a fixup for non-natural 64-bit type alignment in the compat ABI. This will only return true for i386 syscalls on x86_64. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Al Viro <[email protected]>
2020-09-17compat: lift compat_s64 and compat_u64 to <asm-generic/compat.h>Christoph Hellwig1-2/+0
lift the compat_s64 and compat_u64 definitions into common code using the COMPAT_FOR_U64_ALIGNMENT symbol for the x86 special case. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Al Viro <[email protected]>
2020-09-17dma-mapping: introduce DMA range map, supplanting dma_pfn_offsetJim Quinlan1-2/+4
The new field 'dma_range_map' in struct device is used to facilitate the use of single or multiple offsets between mapping regions of cpu addrs and dma addrs. It subsumes the role of "dev->dma_pfn_offset" which was only capable of holding a single uniform offset and had no region bounds checking. The function of_dma_get_range() has been modified so that it takes a single argument -- the device node -- and returns a map, NULL, or an error code. The map is an array that holds the information regarding the DMA regions. Each range entry contains the address offset, the cpu_start address, the dma_start address, and the size of the region. of_dma_configure() is the typical manner to set range offsets but there are a number of ad hoc assignments to "dev->dma_pfn_offset" in the kernel driver code. These cases now invoke the function dma_direct_set_offset(dev, cpu_addr, dma_addr, size). Signed-off-by: Jim Quinlan <[email protected]> [hch: various interface cleanups] Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Mathieu Poirier <[email protected]> Tested-by: Mathieu Poirier <[email protected]> Tested-by: Nathan Chancellor <[email protected]>
2020-09-16ACPI: processor: Use CPUIDLE_FLAG_TLB_FLUSHEDPeter Zijlstra1-2/+0
Make acpi_processor_idle() use the generic TLB flushing code. This again removes RCU usage after rcu_idle_enter(). (XXX make every C3 invalidate TLBs, not just C3-BM) Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Borislav Petkov <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2020-09-16efi: Support for MOK variable config tableLenny Szubowicz2-0/+4
Because of system-specific EFI firmware limitations, EFI volatile variables may not be capable of holding the required contents of the Machine Owner Key (MOK) certificate store when the certificate list grows above some size. Therefore, an EFI boot loader may pass the MOK certs via a EFI configuration table created specifically for this purpose to avoid this firmware limitation. An EFI configuration table is a much more primitive mechanism compared to EFI variables and is well suited for one-way passage of static information from a pre-OS environment to the kernel. This patch adds initial kernel support to recognize, parse, and validate the EFI MOK configuration table, where named entries contain the same data that would otherwise be provided in similarly named EFI variables. Additionally, this patch creates a sysfs binary file for each EFI MOK configuration table entry found. These files are read-only to root and are provided for use by user space utilities such as mokutil. A subsequent patch will load MOK certs into the trusted platform key ring using this infrastructure. Signed-off-by: Lenny Szubowicz <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Ard Biesheuvel <[email protected]>
2020-09-16x86/irq: Make most MSI ops XEN privateThomas Gleixner2-9/+14
Nothing except XEN uses the setup/teardown ops. Hide them there. Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-16x86/irq: Cleanup the arch_*_msi_irqs() leftoversThomas Gleixner6-60/+0
Get rid of all the gunk and remove the 'select PCI_MSI_ARCH_FALLBACK' from the x86 Kconfig so the weak functions in the PCI core are replaced by stubs which emit a warning, which ensures that any fail to set the irq domain pointer results in a warning when the device is used. Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-16PCI/MSI: Make arch_.*_msi_irq[s] fallbacks selectableThomas Gleixner1-0/+1
The arch_.*_msi_irq[s] fallbacks are compiled in whether an architecture requires them or not. Architectures which are fully utilizing hierarchical irq domains should never call into that code. It's not only architectures which depend on that by implementing one or more of the weak functions, there is also a bunch of drivers which relies on the weak functions which invoke msi_controller::setup_irq[s] and msi_controller::teardown_irq. Make the architectures and drivers which rely on them select them in Kconfig and if not selected replace them by stub functions which emit a warning and fail the PCI/MSI interrupt allocation. Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-16x86/pci: Set default irq domain in pcibios_add_device()Thomas Gleixner3-2/+20
Now that interrupt remapping sets the irqdomain pointer when a PCI device is added it's possible to store the default irq domain in the device struct in pcibios_add_device(). If the bus to which a device is connected has an irq domain associated then this domain is used otherwise the default domain (PCI/MSI native or XEN PCI/MSI) is used. Using the bus domain ensures that special MSI bus domains like VMD work. This makes XEN and the non-remapped native case work solely based on the irq domain pointer in struct device for PCI/MSI and allows to remove the arch fallback and make most of the x86_msi ops private to XEN in the next steps. Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-16x86/xen: Wrap XEN MSI management into irqdomainThomas Gleixner1-0/+63
To allow utilizing the irq domain pointer in struct device it is necessary to make XEN/MSI irq domain compatible. While the right solution would be to truly convert XEN to irq domains, this is an exercise which is not possible for mere mortals with limited XENology. Provide a plain irqdomain wrapper around XEN. While this is blatant violation of the irqdomain design, it's the only solution for a XEN igorant person to make progress on the issue which triggered this change. Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Juergen Gross <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-16x86/xen: Consolidate XEN-MSI initThomas Gleixner1-19/+32
X86 cannot store the irq domain pointer in struct device without breaking XEN because the irq domain pointer takes precedence over arch_*_msi_irqs() fallbacks. To achieve this XEN MSI interrupt management needs to be wrapped into an irq domain. Move the x86_msi ops setup into a single function to prepare for this. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Juergen Gross <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-16x86/xen: Rework MSI teardownThomas Gleixner1-5/+18
X86 cannot store the irq domain pointer in struct device without breaking XEN because the irq domain pointer takes precedence over arch_*_msi_irqs() fallbacks. XENs MSI teardown relies on default_teardown_msi_irqs() which invokes arch_teardown_msi_irq(). default_teardown_msi_irqs() is a trivial iterator over the msi entries associated to a device. Implement this loop in xen_teardown_msi_irqs() to prepare for removal of the fallbacks for X86. This is a preparatory step to wrap XEN MSI alloc/free into a irq domain which in turn allows to store the irq domain pointer in struct device and to use the irq domain functions directly. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Juergen Gross <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-16x86/xen: Make xen_msi_init() static and rename it to xen_hvm_msi_init()Thomas Gleixner1-2/+2
The only user is in the same file and the name is too generic because this function is only ever used for HVM domains. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Juergen Gross<[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-16x86/irq: Initialize PCI/MSI domain at PCI init timeThomas Gleixner6-17/+32
No point in initializing the default PCI/MSI interrupt domain early and no point to create it when XEN PV/HVM/DOM0 are active. Move the initialization to pci_arch_init() and convert it to init ops so that XEN can override it as XEN has it's own PCI/MSI management. The XEN override comes in a later step. Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-16x86/pci: Reducde #ifdeffery in PCI init codeThomas Gleixner2-7/+14
Adding a function call before the first #ifdef in arch_pci_init() triggers a 'mixed declarations and code' warning if PCI_DIRECT is enabled. Use stub functions and move the #ifdeffery to the header file where it is not in the way. Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-16x86/irq: Move apic_post_init() invocation to one placeThomas Gleixner3-6/+3
No point to call it from both 32bit and 64bit implementations of default_setup_apic_routing(). Move it to the caller. Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-16x86/msi: Use generic MSI domain opsThomas Gleixner2-31/+1
pci_msi_get_hwirq() and pci_msi_set_desc are not longer special. Enable the generic MSI domain ops in the core and PCI MSI code unconditionally and get rid of the x86 specific implementations in the X86 MSI code and in the hyperv PCI driver. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Marc Zyngier <[email protected]> Link: https://lore.kernel.org/r/[email protected]