Age | Commit message (Collapse) | Author | Files | Lines |
|
The bit 48 in the PERF_GLOBAL_STATUS is used to indicate the overflow
status of the PERF_METRICS counters.
Move the BTS index to the bit 47.
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
The fourth fixed counter, TOPDOWN.SLOTS, is introduced in Ice Lake to
measure the level 1 TopDown events.
Add MSR address and macros for the new fixed counter, which will be used
in a later patch.
Add comments to explain the event encoding rules for the fixed counters.
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
Magic numbers are used in the current NMI handler for the global status
bit. Use a meaningful name to replace the magic numbers to improve the
readability of the code.
Remove a Tab for all GLOBAL_STATUS_* and INTEL_PMC_IDX_FIXED_BTS macros
to reduce the length of the line.
Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
The RDPMC index is always re-calculated for the RDPMC userspace support,
which is unnecessary.
The RDPMC index value is stored in the variable event_base_rdpmc for
the kernel usage, which can be used for RDPMC userspace support as well.
Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
Current minimum required version of binutils is 2.23, which supports
XGETBV and XSETBV instruction mnemonics.
Replace the byte-wise specification of XGETBV and XSETBV with these
proper mnemonics.
Signed-off-by: Uros Bizjak <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
See the SDM, volume 3, section 4.4.1:
If PAE paging would be in use following an execution of MOV to CR0 or
MOV to CR4 (see Section 4.1.1) and the instruction is modifying any of
CR0.CD, CR0.NW, CR0.PG, CR4.PAE, CR4.PGE, CR4.PSE, or CR4.SMEP; then
the PDPTEs are loaded from the address in CR3.
Fixes: b9baba8614890 ("KVM, pkeys: expose CPUID/CR4 to guest")
Cc: Huaitong Han <[email protected]>
Signed-off-by: Jim Mattson <[email protected]>
Reviewed-by: Peter Shier <[email protected]>
Reviewed-by: Oliver Upton <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
See the SDM, volume 3, section 4.4.1:
If PAE paging would be in use following an execution of MOV to CR0 or
MOV to CR4 (see Section 4.1.1) and the instruction is modifying any of
CR0.CD, CR0.NW, CR0.PG, CR4.PAE, CR4.PGE, CR4.PSE, or CR4.SMEP; then
the PDPTEs are loaded from the address in CR3.
Fixes: 0be0226f07d14 ("KVM: MMU: fix SMAP virtualization")
Cc: Xiao Guangrong <[email protected]>
Signed-off-by: Jim Mattson <[email protected]>
Reviewed-by: Peter Shier <[email protected]>
Reviewed-by: Oliver Upton <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
The PK bit of the error code is computed dynamically in permission_fault
and therefore need not be passed to gva_to_gpa: only the access bits
(fetch, user, write) need to be passed down.
Not doing so causes a splat in the pku test:
WARNING: CPU: 25 PID: 5465 at arch/x86/kvm/mmu.h:197 paging64_walk_addr_generic+0x594/0x750 [kvm]
Hardware name: Intel Corporation WilsonCity/WilsonCity, BIOS WLYDCRB1.SYS.0014.D62.2001092233 01/09/2020
RIP: 0010:paging64_walk_addr_generic+0x594/0x750 [kvm]
Code: <0f> 0b e9 db fe ff ff 44 8b 43 04 4c 89 6c 24 30 8b 13 41 39 d0 89
RSP: 0018:ff53778fc623fb60 EFLAGS: 00010202
RAX: 0000000000000001 RBX: ff53778fc623fbf0 RCX: 0000000000000007
RDX: 0000000000000001 RSI: 0000000000000002 RDI: ff4501efba818000
RBP: 0000000000000020 R08: 0000000000000005 R09: 00000000004000e7
R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000007
R13: ff4501efba818388 R14: 10000000004000e7 R15: 0000000000000000
FS: 00007f2dcf31a700(0000) GS:ff4501f1c8040000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 0000001dea475005 CR4: 0000000000763ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
paging64_gva_to_gpa+0x3f/0xb0 [kvm]
kvm_fixup_and_inject_pf_error+0x48/0xa0 [kvm]
handle_exception_nmi+0x4fc/0x5b0 [kvm_intel]
kvm_arch_vcpu_ioctl_run+0x911/0x1c10 [kvm]
kvm_vcpu_ioctl+0x23e/0x5d0 [kvm]
ksys_ioctl+0x92/0xb0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x3e/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
---[ end trace d17eb998aee991da ]---
Reported-by: Sean Christopherson <[email protected]>
Fixes: 897861479c064 ("KVM: x86: Add helper functions for illegal GPA checking and page fault injection")
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
The SERIALIZE instruction gives software a way to force the processor to
complete all modifications to flags, registers and memory from previous
instructions and drain all buffered writes to memory before the next
instruction is fetched and executed. Thus, it serves the purpose of
sync_core(). Use it when available.
Suggested-by: Andy Lutomirski <[email protected]>
Signed-off-by: Ricardo Neri <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
Linux only has support to read total DDR reads and writes. Here we
add support to enable bandwidth breakdown-GT, IA and IO. Breakdown
of BW is important to debug and optimize memory access. This can also
be used for telemetry and improving the system software.The offsets for
GT, IA and IO are added and these free running counters can be accessed
via MMIO space.
The BW breakdown can be measured using the following cmd:
perf stat -e uncore_imc/gt_requests/,uncore_imc/ia_requests/,uncore_imc/io_requests/
30.57 MiB uncore_imc/gt_requests/
1346.13 MiB uncore_imc/ia_requests/
190.97 MiB uncore_imc/io_requests/
5.984572733 seconds time elapsed
BW/s = <gt,ia,io>_requests/time elapsed
Signed-off-by: Vaibhav Shankar <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
"Misc fixes and small updates all around the place:
- Fix mitigation state sysfs output
- Fix an FPU xstate/sxave code assumption bug triggered by
Architectural LBR support
- Fix Lightning Mountain SoC TSC frequency enumeration bug
- Fix kexec debug output
- Fix kexec memory range assumption bug
- Fix a boundary condition in the crash kernel code
- Optimize porgatory.ro generation a bit
- Enable ACRN guests to use X2APIC mode
- Reduce a __text_poke() IRQs-off critical section for the benefit of
PREEMPT_RT"
* tag 'x86-urgent-2020-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/alternatives: Acquire pte lock with interrupts enabled
x86/bugs/multihit: Fix mitigation reporting when VMX is not in use
x86/fpu/xstate: Fix an xstate size check warning with architectural LBRs
x86/purgatory: Don't generate debug info for purgatory.ro
x86/tsr: Fix tsc frequency enumeration bug on Lightning Mountain SoC
kexec_file: Correctly output debugging information for the PT_LOAD ELF header
kexec: Improve & fix crash_exclude_mem_range() to handle overlapping ranges
x86/crash: Correct the address boundary of function parameters
x86/acrn: Remove redundant chars from ACRN signature
x86/acrn: Allow ACRN guest to use X2APIC mode
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
"Misc fixes, an expansion of perf syscall access to CAP_PERFMON
privileged tools, plus a RAPL HW-enablement for Intel SPR platforms"
* tag 'perf-urgent-2020-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/rapl: Add support for Intel SPR platform
perf/x86/rapl: Support multiple RAPL unit quirks
perf/x86/rapl: Fix missing psys sysfs attributes
hw_breakpoint: Remove unused __register_perf_hw_breakpoint() declaration
kprobes: Remove show_registers() function prototype
perf/core: Take over CAP_SYS_PTRACE creds to CAP_PERFMON capability
|
|
The comment explaining why 4-level systems only need to allocate on
the P4D level caused some confustion. Update it to better explain why
on 4-level systems the allocation on PUD level is necessary.
Signed-off-by: Joerg Roedel <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Remove the code to sync the vmalloc and ioremap ranges for x86-64. The
page-table pages are all pre-allocated so that synchronization is
no longer necessary.
This is a patch that already went into the kernel as:
commit 8bb9bf242d1f ("x86/mm/64: Do not sync vmalloc/ioremap mappings")
But it had to be reverted later because it unveiled a bug from:
commit 6eb82f994026 ("x86/mm: Pre-allocate P4D/PUD pages for vmalloc area")
The bug in that commit causes the P4D/PUD pages not to be correctly
allocated, making the synchronization still necessary. That issue got
fixed meanwhile upstream:
commit 995909a4e22b ("x86/mm/64: Do not dereference non-present PGD entries")
With that fix it is safe again to remove the page-table synchronization
for vmalloc/ioremap ranges on x86-64.
Signed-off-by: Joerg Roedel <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
pte_clear() et al are based on two paravirt steps today: one step to
create a page table entry with all zeroes, and one step to write this
entry value.
Drop the first step as it is completely useless.
Signed-off-by: Juergen Gross <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
On x86 set_pte_at() is now always falling back to set_pte(). So instead
of having this fallback after the paravirt maze just drop the
set_pte_at paravirt operation and let set_pte_at() use the set_pte()
function directly.
Signed-off-by: Juergen Gross <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
With 32-bit Xen PV support gone, the following commit is not needed
anymore:
a4c0e91d1d65bc58 ("x86/entry/32: Fix XEN_PV build dependency")
Signed-off-by: Juergen Gross <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
There are some code parts using CONFIG_PARAVIRT for Xen pvops related
issues instead of the more stringent CONFIG_PARAVIRT_XXL.
Signed-off-by: Juergen Gross <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Some paravirt macros are no longer used, delete them.
Signed-off-by: Juergen Gross <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The last 32-bit user of stuff under CONFIG_PARAVIRT_XXL is gone.
Remove 32-bit specific parts.
Signed-off-by: Juergen Gross <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Since commit 61a47c1ad3a4dc ("sysctl: Remove the sysctl system call"),
sys_sysctl is actually unavailable: any input can only return an error.
We have been warning about people using the sysctl system call for years
and believe there are no more users. Even if there are users of this
interface if they have not complained or fixed their code by now they
probably are not going to, so there is no point in warning them any
longer.
So completely remove sys_sysctl on all architectures.
[[email protected]: s390: fix build error for sys_call_table_emu]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Xiaoming Ni <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Acked-by: Will Deacon <[email protected]> [arm/arm64]
Acked-by: "Eric W. Biederman" <[email protected]>
Cc: Aleksa Sarai <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Bin Meng <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: chenzefeng <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: David Howells <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Diego Elio Pettenò <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Dominik Brodowski <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Iurii Zaikin <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: James Bottomley <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Kars de Jong <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Krzysztof Kozlowski <[email protected]>
Cc: Luis Chamberlain <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Martin K. Petersen <[email protected]>
Cc: Masahiro Yamada <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Michal Simek <[email protected]>
Cc: Miklos Szeredi <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Naveen N. Rao <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Olof Johansson <[email protected]>
Cc: Paul Burton <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Peter Zijlstra (Intel) <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Ravi Bangoria <[email protected]>
Cc: Richard Henderson <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Russell King <[email protected]>
Cc: Sami Tolvanen <[email protected]>
Cc: Sargun Dhillon <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Sudeep Holla <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Thiago Jung Bauermann <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Cc: Zhou Yanjie <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timekeeping updates from Thomas Gleixner:
"A set of timekeeping/VDSO updates:
- Preparatory work to allow S390 to switch over to the generic VDSO
implementation.
S390 requires that the VDSO data pointer is handed in to the
counter read function when time namespace support is enabled.
Adding the pointer is a NOOP for all other architectures because
the compiler is supposed to optimize that out when it is unused in
the architecture specific inline. The change also solved a similar
problem for MIPS which fortunately has time namespaces not yet
enabled.
S390 needs to update clock related VDSO data independent of the
timekeeping updates. This was solved so far with yet another
sequence counter in the S390 implementation. A better solution is
to utilize the already existing VDSO sequence count for this. The
core code now exposes helper functions which allow to serialize
against the timekeeper code and against concurrent readers.
S390 needs extra data for their clock readout function. The initial
common VDSO data structure did not provide a way to add that. It
now has an embedded architecture specific struct embedded which
defaults to an empty struct.
Doing this now avoids tree dependencies and conflicts post rc1 and
allows all other architectures which work on generic VDSO support
to work from a common upstream base.
- A trivial comment fix"
* tag 'timers-urgent-2020-08-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
time: Delete repeated words in comments
lib/vdso: Allow to add architecture-specific vdso data
timekeeping/vsyscall: Provide vdso_update_begin/end()
vdso/treewide: Add vdso_data pointer argument to __arch_get_hw_counter()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull more timer updates from Thomas Gleixner:
"A set of posix CPU timer changes which allows to defer the heavy work
of posix CPU timers into task work context. The tick interrupt is
reduced to a quick check which queues the work which is doing the
heavy lifting before returning to user space or going back to guest
mode. Moving this out is deferring the signal delivery slightly but
posix CPU timers are inaccurate by nature as they depend on the tick
so there is no real damage. The relevant test cases all passed.
This lifts the last offender for RT out of the hard interrupt context
tick handler, but it also has the general benefit that the actual
heavy work is accounted to the task/process and not to the tick
interrupt itself.
Further optimizations are possible to break long sighand lock hold and
interrupt disabled (on !RT kernels) times when a massive amount of
posix CPU timers (which are unpriviledged) is armed for a
task/process.
This is currently only enabled for x86 because the architecture has to
ensure that task work is handled in KVM before entering a guest, which
was just established for x86 with the new common entry/exit code which
got merged post 5.8 and is not the case for other KVM architectures"
* tag 'timers-core-2020-08-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Select POSIX_CPU_TIMERS_TASK_WORK
posix-cpu-timers: Provide mechanisms to defer timer handling to task_work
posix-cpu-timers: Split run_posix_cpu_timers()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull more xen updates from Juergen Gross:
- Remove support for running as 32-bit Xen PV-guest.
32-bit PV guests are rarely used, are lacking security fixes for
Meltdown, and can be easily replaced by PVH mode. Another series for
doing more cleanup will follow soon (removal of 32-bit-only pvops
functionality).
- Fixes and additional features for the Xen display frontend driver.
* tag 'for-linus-5.9-rc1b-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
drm/xen-front: Pass dumb buffer data offset to the backend
xen: Sync up with the canonical protocol definition in Xen
drm/xen-front: Add YUYV to supported formats
drm/xen-front: Fix misused IS_ERR_OR_NULL checks
xen/gntdev: Fix dmabuf import with non-zero sgt offset
x86/xen: drop tests for highmem in pv code
x86/xen: eliminate xen-asm_64.S
x86/xen: remove 32-bit Xen PV guest support
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull hyper-v fixes from Wei Liu:
- fix oops reporting on Hyper-V
- make objtool happy
* tag 'hyperv-fixes-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
x86/hyperv: Make hv_setup_sched_clock inline
Drivers: hv: vmbus: Only notify Hyper-V for die events that are oops
|
|
syzbot found its way in 86_fsgsbase_read_task() and triggered this oops:
KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
CPU: 0 PID: 6866 Comm: syz-executor262 Not tainted 5.8.0-syzkaller #0
RIP: 0010:x86_fsgsbase_read_task+0x16d/0x310 arch/x86/kernel/process_64.c:393
Call Trace:
putreg32+0x3ab/0x530 arch/x86/kernel/ptrace.c:876
genregs32_set arch/x86/kernel/ptrace.c:1026 [inline]
genregs32_set+0xa4/0x100 arch/x86/kernel/ptrace.c:1006
copy_regset_from_user include/linux/regset.h:326 [inline]
ia32_arch_ptrace arch/x86/kernel/ptrace.c:1061 [inline]
compat_arch_ptrace+0x36c/0xd90 arch/x86/kernel/ptrace.c:1198
__do_compat_sys_ptrace kernel/ptrace.c:1420 [inline]
__se_compat_sys_ptrace kernel/ptrace.c:1389 [inline]
__ia32_compat_sys_ptrace+0x220/0x2f0 kernel/ptrace.c:1389
do_syscall_32_irqs_on arch/x86/entry/common.c:84 [inline]
__do_fast_syscall_32+0x57/0x80 arch/x86/entry/common.c:126
do_fast_syscall_32+0x2f/0x70 arch/x86/entry/common.c:149
entry_SYSENTER_compat_after_hwframe+0x4d/0x5c
This can happen if ptrace() or sigreturn() pokes an LDT selector into FS
or GS for a task with no LDT and something tries to read the base before
a return to usermode notices the bad selector and fixes it.
The fix is to make sure ldt pointer is not NULL.
Fixes: 07e1d88adaae ("x86/fsgsbase/64: Fix ptrace() to read the FS/GS base accurately")
Co-developed-by: Jann Horn <[email protected]>
Signed-off-by: Eric Dumazet <[email protected]>
Reported-by: syzbot <[email protected]>
Acked-by: Andy Lutomirski <[email protected]>
Cc: Chang S. Bae <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Markus T Metzger <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ravi Shankar <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Add a linker script check that there are no run-time relocations, and
remove the old one that tries to check via looking for specially-named
sections in the object files.
Drop the tests for -fPIE compiler option and -pie linker option, as they
are available in all supported gcc and binutils versions (as well as
clang and lld).
Signed-off-by: Arvind Sankar <[email protected]>
Signed-off-by: Kees Cook <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Tested-by: Nick Desaulniers <[email protected]>
Tested-by: Sedat Dilek <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Reviewed-by: Ard Biesheuvel <[email protected]>
Reviewed-by: Fangrui Song <[email protected]>
Reviewed-by: Sedat Dilek <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The BFD linker generates run-time relocations for z_input_len and
z_output_len, even though they are absolute symbols.
This is fixed for binutils-2.35 [1]. Work around this for earlier
versions by defining two variables input_len and output_len in addition
to the symbols, and use them via position-independent references.
This eliminates the last two run-time relocations in the head code and
allows us to drop the -z noreloc-overflow flag to the linker.
Move the -pie and --no-dynamic-linker LDFLAGS to LDFLAGS_vmlinux instead
of KBUILD_LDFLAGS. There shouldn't be anything else getting linked, but
this is the more logical location for these flags, and modversions might
call the linker if an EXPORT_SYMBOL is left over accidentally in one of
the decompressors.
[1] https://sourceware.org/bugzilla/show_bug.cgi?id=25754
Signed-off-by: Arvind Sankar <[email protected]>
Signed-off-by: Kees Cook <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Tested-by: Nick Desaulniers <[email protected]>
Tested-by: Sedat Dilek <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Reviewed-by: Ard Biesheuvel <[email protected]>
Reviewed-by: Fangrui Song <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The assembly code in head_{32,64}.S, while meant to be
position-independent, generates run-time relocations because it uses
instructions such as:
leal gdt(%edx), %eax
which make the assembler and linker think that the code is using %edx as
an index into gdt, and hence gdt needs to be relocated to its run-time
address.
On 32-bit, with lld Dmitry Golovin reports that this results in a
link-time error with default options (i.e. unless -z notext is
explicitly passed):
LD arch/x86/boot/compressed/vmlinux
ld.lld: error: can't create dynamic relocation R_386_32 against local
symbol in readonly segment; recompile object files with -fPIC or pass
'-Wl,-z,notext' to allow text relocations in the output
With the BFD linker, this generates a warning during the build, if
--warn-shared-textrel is enabled, which at least Gentoo enables by
default:
LD arch/x86/boot/compressed/vmlinux
ld: arch/x86/boot/compressed/head_32.o: warning: relocation in read-only section `.head.text'
ld: warning: creating a DT_TEXTREL in object
On 64-bit, it is not possible to link the kernel as -pie with lld, and
it is only possible with a BFD linker that supports -z noreloc-overflow,
i.e. versions >2.26. This is because these instructions cannot really be
relocated: the displacement field is only 32-bits wide, and thus cannot
be relocated for a 64-bit load address. The -z noreloc-overflow option
simply overrides the linker error, and results in R_X86_64_RELATIVE
relocations that apply a 64-bit relocation to a 32-bit field anyway.
This happens to work because nothing will process these run-time
relocations.
Start fixing this by removing relocations from .head.text:
- On 32-bit, use a base register that holds the address of the GOT and
reference symbol addresses using @GOTOFF, i.e.
leal gdt@GOTOFF(%edx), %eax
- On 64-bit, most of the code can (and already does) use %rip-relative
addressing, however the .code32 bits can't, and the 64-bit code also
needs to reference symbol addresses as they will be after moving the
compressed kernel to the end of the decompression buffer.
For these cases, reference the symbols as an offset to startup_32 to
avoid creating relocations, i.e.:
leal (gdt-startup_32)(%bp), %eax
This only works in .head.text as the subtraction cannot be represented
as a PC-relative relocation unless startup_32 is in the same section
as the code. Move efi32_pe_entry into .head.text so that it can use
the same method to avoid relocations.
Reported-by: Dmitry Golovin <[email protected]>
Signed-off-by: Arvind Sankar <[email protected]>
Signed-off-by: Kees Cook <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Tested-by: Nick Desaulniers <[email protected]>
Tested-by: Sedat Dilek <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Reviewed-by: Ard Biesheuvel <[email protected]>
Reviewed-by: Fangrui Song <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
GCC puts the main function into .text.startup when compiled with -Os (or
-O2). This results in arch/x86/boot/main.c having a .text.startup
section which is currently not included explicitly in the linker script
setup.ld in the same directory.
The BFD linker places this orphan section immediately after .text, so
this still works. However, LLD git, since [1], is choosing to place it
immediately after the .bstext section instead (this is the first code
section). This plays havoc with the section layout that setup.elf
requires to create the setup header, for eg on 64-bit:
LD arch/x86/boot/setup.elf
ld.lld: error: section .text.startup file range overlaps with .header
>>> .text.startup range is [0x200040, 0x2001FE]
>>> .header range is [0x2001EF, 0x20026B]
ld.lld: error: section .header file range overlaps with .bsdata
>>> .header range is [0x2001EF, 0x20026B]
>>> .bsdata range is [0x2001FF, 0x200398]
ld.lld: error: section .bsdata file range overlaps with .entrytext
>>> .bsdata range is [0x2001FF, 0x200398]
>>> .entrytext range is [0x20026C, 0x2002D3]
ld.lld: error: section .text.startup virtual address range overlaps
with .header
>>> .text.startup range is [0x40, 0x1FE]
>>> .header range is [0x1EF, 0x26B]
ld.lld: error: section .header virtual address range overlaps with
.bsdata
>>> .header range is [0x1EF, 0x26B]
>>> .bsdata range is [0x1FF, 0x398]
ld.lld: error: section .bsdata virtual address range overlaps with
.entrytext
>>> .bsdata range is [0x1FF, 0x398]
>>> .entrytext range is [0x26C, 0x2D3]
ld.lld: error: section .text.startup load address range overlaps with
.header
>>> .text.startup range is [0x40, 0x1FE]
>>> .header range is [0x1EF, 0x26B]
ld.lld: error: section .header load address range overlaps with
.bsdata
>>> .header range is [0x1EF, 0x26B]
>>> .bsdata range is [0x1FF, 0x398]
ld.lld: error: section .bsdata load address range overlaps with
.entrytext
>>> .bsdata range is [0x1FF, 0x398]
>>> .entrytext range is [0x26C, 0x2D3]
Add .text.* to the .text output section to fix this, and also prevent
any future surprises if the compiler decides to create other such
sections.
[1] https://reviews.llvm.org/D75225
Signed-off-by: Arvind Sankar <[email protected]>
Signed-off-by: Kees Cook <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Tested-by: Nick Desaulniers <[email protected]>
Tested-by: Sedat Dilek <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Reviewed-by: Ard Biesheuvel <[email protected]>
Reviewed-by: Fangrui Song <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
In a previous patch, we have eliminated GOT entries from the decompressor
binary and added an assertion that the .got section is empty. This means
that the GOT fixup routines that exist in both the 32-bit and 64-bit
startup routines have become dead code, and can be removed.
While at it, drop the KEEP() from the linker script, as it has no effect
on the contents of output sections that are created by the linker itself.
Signed-off-by: Ard Biesheuvel <[email protected]>
Signed-off-by: Arvind Sankar <[email protected]>
Signed-off-by: Kees Cook <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Tested-by: Nick Desaulniers <[email protected]>
Tested-by: Sedat Dilek <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Acked-by: Arvind Sankar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Eliminate all GOT entries in the decompressor binary, by forcing hidden
visibility for all symbol references, which informs the compiler that
such references will be resolved at link time without the need for
allocating GOT entries.
To ensure that no GOT entries will creep back in, add an assertion to
the decompressor linker script that will fire if the .got section has
a non-zero size.
[Arvind: move hidden.h to include/linux instead of making a copy]
Signed-off-by: Ard Biesheuvel <[email protected]>
Signed-off-by: Arvind Sankar <[email protected]>
Signed-off-by: Kees Cook <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Tested-by: Nick Desaulniers <[email protected]>
Tested-by: Sedat Dilek <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Acked-by: Arvind Sankar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The .got.plt section contains the part of the GOT which is used by PLT
entries, and which gets updated lazily by the dynamic loader when
function calls are dispatched through those PLT entries.
On fully linked binaries such as the kernel proper or the decompressor,
this never happens, and so in practice, the .got.plt section consists
only of the first 3 magic entries that are meant to point at the _DYNAMIC
section and at the fixup routine in the loader. However, since we don't
use a dynamic loader, those entries are never populated or used.
This means that treating those entries like ordinary GOT entries, and
updating their values based on the actual placement of the executable in
memory is completely pointless, and we can just ignore the .got.plt
section entirely, provided that it has no additional entries beyond
the first 3 ones.
So add an assertion in the linker script to ensure that this assumption
holds, and move the contents out of the [_got, _egot) memory range that
is modified by the GOT fixup routines.
While at it, drop the KEEP(), since it has no effect on the contents
of output sections that are created by the linker itself.
Signed-off-by: Ard Biesheuvel <[email protected]>
Signed-off-by: Arvind Sankar <[email protected]>
Signed-off-by: Kees Cook <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Tested-by: Sedat Dilek <[email protected]>
Tested-by: Nick Desaulniers <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Acked-by: Arvind Sankar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Intel SPR platform uses fixed 16 bit energy unit for DRAM RAPL domain,
and fixed 0 bit energy unit for Psys RAPL domain.
After this, on SPR platform the energy counters appear in perf list.
Signed-off-by: Zhang Rui <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Reviewed-by: Kan Liang <[email protected]>
Acked-by: Len Brown <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
There will be more platforms with different fixed energy units.
Enhance the code to support different RAPL unit quirks for different
platforms.
Signed-off-by: Zhang Rui <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Reviewed-by: Kan Liang <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
This fixes a problem introduced by commit:
5fb5273a905c ("perf/x86/rapl: Use new MSR detection interface")
that perf event sysfs attributes for psys RAPL domain are missing.
Fixes: 5fb5273a905c ("perf/x86/rapl: Use new MSR detection interface")
Signed-off-by: Zhang Rui <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Reviewed-by: Kan Liang <[email protected]>
Reviewed-by: Len Brown <[email protected]>
Acked-by: Jiri Olsa <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
pte lock is never acquired in-IRQ context so it does not require interrupts
to be disabled. The lock is a regular spinlock which cannot be acquired
with interrupts disabled on RT.
RT complains about pte_lock() in __text_poke() because it's invoked after
disabling interrupts.
__text_poke() has to disable interrupts as use_temporary_mm() expects
interrupts to be off because it invokes switch_mm_irqs_off() and uses
per-CPU (current active mm) data.
Move the PTE lock handling outside the interrupt disabled region.
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Acked-by; Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Pull more KVM updates from Paolo Bonzini:
"PPC:
- Improvements and bugfixes for secure VM support, giving reduced
startup time and memory hotplug support.
- Locking fixes in nested KVM code
- Increase number of guests supported by HV KVM to 4094
- Preliminary POWER10 support
ARM:
- Split the VHE and nVHE hypervisor code bases, build the EL2 code
separately, allowing for the VHE code to now be built with
instrumentation
- Level-based TLB invalidation support
- Restructure of the vcpu register storage to accomodate the NV code
- Pointer Authentication available for guests on nVHE hosts
- Simplification of the system register table parsing
- MMU cleanups and fixes
- A number of post-32bit cleanups and other fixes
MIPS:
- compilation fixes
x86:
- bugfixes
- support for the SERIALIZE instruction"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (70 commits)
KVM: MIPS/VZ: Fix build error caused by 'kvm_run' cleanup
x86/kvm/hyper-v: Synic default SCONTROL MSR needs to be enabled
MIPS: KVM: Convert a fallthrough comment to fallthrough
MIPS: VZ: Only include loongson_regs.h for CPU_LOONGSON64
x86: Expose SERIALIZE for supported cpuid
KVM: x86: Don't attempt to load PDPTRs when 64-bit mode is enabled
KVM: arm64: Move S1PTW S2 fault logic out of io_mem_abort()
KVM: arm64: Don't skip cache maintenance for read-only memslots
KVM: arm64: Handle data and instruction external aborts the same way
KVM: arm64: Rename kvm_vcpu_dabt_isextabt()
KVM: arm: Add trace name for ARM_NISV
KVM: arm64: Ensure that all nVHE hyp code is in .hyp.text
KVM: arm64: Substitute RANDOMIZE_BASE for HARDEN_EL2_VECTORS
KVM: arm64: Make nVHE ASLR conditional on RANDOMIZE_BASE
KVM: PPC: Book3S HV: Rework secure mem slot dropping
KVM: PPC: Book3S HV: Move kvmppc_svm_page_out up
KVM: PPC: Book3S HV: Migrate hot plugged memory
KVM: PPC: Book3S HV: In H_SVM_INIT_DONE, migrate remaining normal-GFNs to secure-GFNs
KVM: PPC: Book3S HV: Track the state GFNs associated with secure VMs
KVM: PPC: Book3S HV: Disable page merging in H_SVM_INIT_START
...
|
|
Merge more updates from Andrew Morton:
- most of the rest of MM (memcg, hugetlb, vmscan, proc, compaction,
mempolicy, oom-kill, hugetlbfs, migration, thp, cma, util,
memory-hotplug, cleanups, uaccess, migration, gup, pagemap),
- various other subsystems (alpha, misc, sparse, bitmap, lib, bitops,
checkpatch, autofs, minix, nilfs, ufs, fat, signals, kmod, coredump,
exec, kdump, rapidio, panic, kcov, kgdb, ipc).
* emailed patches from Andrew Morton <[email protected]>: (164 commits)
mm/gup: remove task_struct pointer for all gup code
mm: clean up the last pieces of page fault accountings
mm/xtensa: use general page fault accounting
mm/x86: use general page fault accounting
mm/sparc64: use general page fault accounting
mm/sparc32: use general page fault accounting
mm/sh: use general page fault accounting
mm/s390: use general page fault accounting
mm/riscv: use general page fault accounting
mm/powerpc: use general page fault accounting
mm/parisc: use general page fault accounting
mm/openrisc: use general page fault accounting
mm/nios2: use general page fault accounting
mm/nds32: use general page fault accounting
mm/mips: use general page fault accounting
mm/microblaze: use general page fault accounting
mm/m68k: use general page fault accounting
mm/ia64: use general page fault accounting
mm/hexagon: use general page fault accounting
mm/csky: use general page fault accounting
...
|
|
Use the general page fault accounting by passing regs into
handle_mm_fault().
Signed-off-by: Peter Xu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "mm: Page fault accounting cleanups", v5.
This is v5 of the pf accounting cleanup series. It originates from Gerald
Schaefer's report on an issue a week ago regarding to incorrect page fault
accountings for retried page fault after commit 4064b9827063 ("mm: allow
VM_FAULT_RETRY for multiple times"):
https://lore.kernel.org/lkml/20200610174811.44b94525@thinkpad/
What this series did:
- Correct page fault accounting: we do accounting for a page fault
(no matter whether it's from #PF handling, or gup, or anything else)
only with the one that completed the fault. For example, page fault
retries should not be counted in page fault counters. Same to the
perf events.
- Unify definition of PERF_COUNT_SW_PAGE_FAULTS: currently this perf
event is used in an adhoc way across different archs.
Case (1): for many archs it's done at the entry of a page fault
handler, so that it will also cover e.g. errornous faults.
Case (2): for some other archs, it is only accounted when the page
fault is resolved successfully.
Case (3): there're still quite some archs that have not enabled
this perf event.
Since this series will touch merely all the archs, we unify this
perf event to always follow case (1), which is the one that makes most
sense. And since we moved the accounting into handle_mm_fault, the
other two MAJ/MIN perf events are well taken care of naturally.
- Unify definition of "major faults": the definition of "major
fault" is slightly changed when used in accounting (not
VM_FAULT_MAJOR). More information in patch 1.
- Always account the page fault onto the one that triggered the page
fault. This does not matter much for #PF handlings, but mostly for
gup. More information on this in patch 25.
Patchset layout:
Patch 1: Introduced the accounting in handle_mm_fault(), not enabled.
Patch 2-23: Enable the new accounting for arch #PF handlers one by one.
Patch 24: Enable the new accounting for the rest outliers (gup, iommu, etc.)
Patch 25: Cleanup GUP task_struct pointer since it's not needed any more
This patch (of 25):
This is a preparation patch to move page fault accountings into the
general code in handle_mm_fault(). This includes both the per task
flt_maj/flt_min counters, and the major/minor page fault perf events. To
do this, the pt_regs pointer is passed into handle_mm_fault().
PERF_COUNT_SW_PAGE_FAULTS should still be kept in per-arch page fault
handlers.
So far, all the pt_regs pointer that passed into handle_mm_fault() is
NULL, which means this patch should have no intented functional change.
Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Cc: Albert Ou <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Cain <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Greentime Hu <[email protected]>
Cc: Guo Ren <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: James E.J. Bottomley <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Jonas Bonn <[email protected]>
Cc: Ley Foon Tan <[email protected]>
Cc: "Luck, Tony" <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Michal Simek <[email protected]>
Cc: Nick Hu <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Richard Henderson <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Russell King <[email protected]>
Cc: Stafford Horne <[email protected]>
Cc: Stefan Kristiansson <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Vincent Chen <[email protected]>
Cc: Vineet Gupta <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
segment_eq is only used to implement uaccess_kernel. Just open code
uaccess_kernel in the arch uaccess headers and remove one layer of
indirection.
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Acked-by: Linus Torvalds <[email protected]>
Acked-by: Greentime Hu <[email protected]>
Acked-by: Geert Uytterhoeven <[email protected]>
Cc: Nick Hu <[email protected]>
Cc: Vincent Chen <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This is to introduce a general dummy helper. memory_add_physaddr_to_nid()
is a fallback option to get the nid in case NUMA_NO_NID is detected.
After this patch, arm64/sh/s390 can simply use the general dummy version.
PowerPC/x86/ia64 will still use their specific version.
This is the preparation to set a fallback value for dev_dax->target_node.
Signed-off-by: Jia He <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Chuhong Yuan <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Logan Gunthorpe <[email protected]>
Cc: Masahiro Yamada <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Kaly Xin <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Some of our servers spend significant time at kernel boot initializing
memory block sysfs directories and then creating symlinks between them and
the corresponding nodes. The slowness happens because the machines get
stuck with the smallest supported memory block size on x86 (128M), which
results in 16,288 directories to cover the 2T of installed RAM. The
search for each memory block is noticeable even with commit 4fb6eabf1037
("drivers/base/memory.c: cache memory blocks in xarray to accelerate
lookup").
Commit 078eb6aa50dc ("x86/mm/memory_hotplug: determine block size based on
the end of boot memory") chooses the block size based on alignment with
memory end. That addresses hotplug failures in qemu guests, but for bare
metal systems whose memory end isn't aligned to even the smallest size, it
leaves them at 128M.
Make kernels that aren't running on a hypervisor use the largest supported
size (2G) to minimize overhead on big machines. Kernel boot goes 7%
faster on the aforementioned servers, shaving off half a second.
[[email protected]: v3]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Daniel Jordan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Steven Sistare <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Pull virtio updates from Michael Tsirkin:
- IRQ bypass support for vdpa and IFC
- MLX5 vdpa driver
- Endianness fixes for virtio drivers
- Misc other fixes
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (71 commits)
vdpa/mlx5: fix up endian-ness for mtu
vdpa: Fix pointer math bug in vdpasim_get_config()
vdpa/mlx5: Fix pointer math in mlx5_vdpa_get_config()
vdpa/mlx5: fix memory allocation failure checks
vdpa/mlx5: Fix uninitialised variable in core/mr.c
vdpa_sim: init iommu lock
virtio_config: fix up warnings on parisc
vdpa/mlx5: Add VDPA driver for supported mlx5 devices
vdpa/mlx5: Add shared memory registration code
vdpa/mlx5: Add support library for mlx5 VDPA implementation
vdpa/mlx5: Add hardware descriptive header file
vdpa: Modify get_vq_state() to return error code
net/vdpa: Use struct for set/get vq state
vdpa: remove hard coded virtq num
vdpasim: support batch updating
vhost-vdpa: support IOTLB batching hints
vhost-vdpa: support get/set backend features
vhost: generialize backend features setting/getting
vhost-vdpa: refine ioctl pre-processing
vDPA: dont change vq irq after DRIVER_OK
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull iommu updates from Joerg Roedel:
- Remove of the dev->archdata.iommu (or similar) pointers from most
architectures. Only Sparc is left, but this is private to Sparc as
their drivers don't use the IOMMU-API.
- ARM-SMMU updates from Will Deacon:
- Support for SMMU-500 implementation in Marvell Armada-AP806 SoC
- Support for SMMU-500 implementation in NVIDIA Tegra194 SoC
- DT compatible string updates
- Remove unused IOMMU_SYS_CACHE_ONLY flag
- Move ARM-SMMU drivers into their own subdirectory
- Intel VT-d updates from Lu Baolu:
- Misc tweaks and fixes for vSVA
- Report/response page request events
- Cleanups
- Move the Kconfig and Makefile bits for the AMD and Intel drivers into
their respective subdirectory.
- MT6779 IOMMU Support
- Support for new chipsets in the Renesas IOMMU driver
- Other misc cleanups and fixes (e.g. to improve compile test coverage)
* tag 'iommu-updates-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (77 commits)
iommu/amd: Move Kconfig and Makefile bits down into amd directory
iommu/vt-d: Move Kconfig and Makefile bits down into intel directory
iommu/arm-smmu: Move Arm SMMU drivers into their own subdirectory
iommu/vt-d: Skip TE disabling on quirky gfx dedicated iommu
iommu: Add gfp parameter to io_pgtable_ops->map()
iommu: Mark __iommu_map_sg() as static
iommu/vt-d: Rename intel-pasid.h to pasid.h
iommu/vt-d: Add page response ops support
iommu/vt-d: Report page request faults for guest SVA
iommu/vt-d: Add a helper to get svm and sdev for pasid
iommu/vt-d: Refactor device_to_iommu() helper
iommu/vt-d: Disable multiple GPASID-dev bind
iommu/vt-d: Warn on out-of-range invalidation address
iommu/vt-d: Fix devTLB flush for vSVA
iommu/vt-d: Handle non-page aligned address
iommu/vt-d: Fix PASID devTLB invalidation
iommu/vt-d: Remove global page support in devTLB flush
iommu/vt-d: Enforce PASID devTLB field mask
iommu: Make some functions static
iommu/amd: Remove double zero check
...
|
|
Make hv_setup_sched_clock inline so the reference to pv_ops works
correctly with objtool updates to detect noinstr violations.
See https://lore.kernel.org/patchwork/patch/1283635/
Signed-off-by: Michael Kelley <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Wei Liu <[email protected]>
|
|
With support for 32-bit pv guests gone pure pv-code no longer needs to
test for highmem. Dropping those tests removes the need for flushing
in some places.
Signed-off-by: Juergen Gross <[email protected]>
Reviewed-by: Boris Ostrovsky <[email protected]>
Signed-off-by: Juergen Gross <[email protected]>
|
|
With 32-bit pv-guest support removed xen-asm_64.S can be merged with
xen-asm.S
Signed-off-by: Juergen Gross <[email protected]>
Reviewed-by: Boris Ostrovsky <[email protected]>
Signed-off-by: Juergen Gross <[email protected]>
|
|
Xen is requiring 64-bit machines today and since Xen 4.14 it can be
built without 32-bit PV guest support. There is no need to carry the
burden of 32-bit PV guest support in the kernel any longer, as new
guests can be either HVM or PVH, or they can use a 64 bit kernel.
Remove the 32-bit Xen PV support from the kernel.
Signed-off-by: Juergen Gross <[email protected]>
Reviewed-by: Boris Ostrovsky <[email protected]>
Signed-off-by: Juergen Gross <[email protected]>
|