Age | Commit message (Collapse) | Author | Files | Lines |
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc hotfixes from Andrew Morton:
"22 hotfixes.
Eight are cc:stable and the remainder address issues which were
introduced post-6.0 or which aren't considered serious enough to
justify a -stable backport"
* tag 'mm-hotfixes-stable-2022-11-11' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (22 commits)
docs: kmsan: fix formatting of "Example report"
mm/damon/dbgfs: check if rm_contexts input is for a real context
maple_tree: don't set a new maximum on the node when not reusing nodes
maple_tree: fix depth tracking in maple_state
arch/x86/mm/hugetlbpage.c: pud_huge() returns 0 when using 2-level paging
fs: fix leaked psi pressure state
nilfs2: fix use-after-free bug of ns_writer on remount
x86/traps: avoid KMSAN bugs originating from handle_bug()
kmsan: make sure PREEMPT_RT is off
Kconfig.debug: ensure early check for KMSAN in CONFIG_KMSAN_WARN
x86/uaccess: instrument copy_from_user_nmi()
kmsan: core: kmsan_in_runtime() should return true in NMI context
mm: hugetlb_vmemmap: include missing linux/moduleparam.h
mm/shmem: use page_mapping() to detect page cache for uffd continue
mm/memremap.c: map FS_DAX device memory as decrypted
Partly revert "mm/thp: carry over dirty bit when thp splits on pmd"
nilfs2: fix deadlock in nilfs_count_free_blocks()
mm/mmap: fix memory leak in mmap_region()
hugetlbfs: don't delete error page from pagecache
maple_tree: reorganize testing to restore module testing
...
|
|
Commit e5d9b714fe40 ("x86/hyperv: fix root partition faults when writing
to VP assist page MSR") moved 'wrmsrl(HV_X64_MSR_VP_ASSIST_PAGE)' under
'if (*hvp)' condition. This works for root partition as hv_cpu_die()
does memunmap() and sets 'hv_vp_assist_page[cpu]' to NULL but breaks
non-root partitions as hv_cpu_die() doesn't free 'hv_vp_assist_page[cpu]'
for them. This causes VP assist page to remain unset after CPU
offline/online cycle:
$ rdmsr -p 24 0x40000073
10212f001
$ echo 0 > /sys/devices/system/cpu/cpu24/online
$ echo 1 > /sys/devices/system/cpu/cpu24/online
$ rdmsr -p 24 0x40000073
0
Fix the issue by always writing to HV_X64_MSR_VP_ASSIST_PAGE in
hv_cpu_init(). Note, checking 'if (!*hvp)', for root partition is
pointless as hv_cpu_die() always sets 'hv_vp_assist_page[cpu]' to
NULL (and it's also NULL initially).
Note: the fact that 'hv_vp_assist_page[cpu]' is reset to NULL may
present a (potential) issue for KVM. While Hyper-V uses
CPUHP_AP_ONLINE_DYN stage in CPU hotplug, KVM uses CPUHP_AP_KVM_STARTING
which comes earlier in CPU teardown sequence. It is theoretically
possible that Enlightened VMCS is still in use. It is unclear if the
issue is real and if using KVM with Hyper-V root partition is even
possible.
While on it, drop the unneeded smp_processor_id() call from hv_cpu_init().
Fixes: e5d9b714fe40 ("x86/hyperv: fix root partition faults when writing to VP assist page MSR")
Signed-off-by: Vitaly Kuznetsov <[email protected]>
Reviewed-by: Michael Kelley <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Wei Liu <[email protected]>
|
|
On all recent Centaur platforms, ARB_DISABLE is handled by PMU
automatically while entering C3 type state. No need for OS to
issue the ARB_DISABLE, so set bm_control to zero to indicate that.
Signed-off-by: Tony W Wang-oc <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
Acked-by: Rafael J. Wysocki <[email protected]>
Link: https://lore.kernel.org/all/1667792089-4904-1-git-send-email-TonyWWang-oc%40zhaoxin.com
|
|
Pull kvm
"This is a pretty large diffstat for this time of the release. The main
culprit is a reorganization of the AMD assembly trampoline, allowing
percpu variables to be accessed early.
This is needed for the return stack depth tracking retbleed mitigation
that will be in 6.2, but it also makes it possible to tighten the IBRS
restore on vmexit. The latter change is a long tail of the
spectrev2/retbleed patches (the corresponding Intel change was simpler
and went in already last June), which is why I am including it right
now instead of sharing a topic branch with tip.
Being assembly and being rich in comments makes the line count balloon
a bit, but I am pretty confident in the change (famous last words)
because the reorganization actually makes everything simpler and more
understandable than before. It has also had external review and has
been tested on the aforementioned 6.2 changes, which explode quite
brutally without the fix.
Apart from this, things are pretty normal.
s390:
- PCI fix
- PV clock fix
x86:
- Fix clash between PMU MSRs and other MSRs
- Prepare SVM assembly trampoline for 6.2 retbleed mitigation and
for...
- ... tightening IBRS restore on vmexit, moving it before the first
RET or indirect branch
- Fix log level for VMSA dump
- Block all page faults during kvm_zap_gfn_range()
Tools:
- kvm_stat: fix incorrect detection of debugfs
- kvm_stat: update vmexit definitions"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86/mmu: Block all page faults during kvm_zap_gfn_range()
KVM: x86/pmu: Limit the maximum number of supported AMD GP counters
KVM: x86/pmu: Limit the maximum number of supported Intel GP counters
KVM: x86/pmu: Do not speculatively query Intel GP PMCs that don't exist yet
KVM: SVM: Only dump VMSA to klog at KERN_DEBUG level
tools/kvm_stat: update exit reasons for vmx/svm/aarch64/userspace
tools/kvm_stat: fix incorrect detection of debugfs
x86, KVM: remove unnecessary argument to x86_virt_spec_ctrl and callers
KVM: SVM: move MSR_IA32_SPEC_CTRL save/restore to assembly
KVM: SVM: restore host save area from assembly
KVM: SVM: move guest vmsave/vmload back to assembly
KVM: SVM: do not allocate struct svm_cpu_data dynamically
KVM: SVM: remove dead field from struct svm_cpu_data
KVM: SVM: remove unused field from struct vcpu_svm
KVM: SVM: retrieve VMCB from assembly
KVM: SVM: adjust register allocation for __svm_vcpu_run()
KVM: SVM: replace regs argument of __svm_vcpu_run() with vcpu_svm
KVM: x86: use a separate asm-offsets.c file
KVM: s390: pci: Fix allocation size of aift kzdev elements
KVM: s390: pv: don't allow userspace to set the clock under PV
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull hyperv fixes from Wei Liu:
- Fix TSC MSR write for root partition (Anirudh Rayabharam)
- Fix definition of vector in pci-hyperv driver (Dexuan Cui)
- A few other misc patches
* tag 'hyperv-fixes-signed-20221110' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
PCI: hv: Fix the definition of vector in hv_compose_msi_msg()
MAINTAINERS: remove sthemmin
x86/hyperv: fix invalid writes to MSRs during root partition kexec
clocksource/drivers/hyperv: add data structure for reference TSC MSR
Drivers: hv: fix repeated words in comments
x86/hyperv: Remove BUG_ON() for kmap_local_page()
|
|
When zapping a GFN range, pass 0 => ALL_ONES for the to-be-invalidated
range to effectively block all page faults while the zap is in-progress.
The invalidation helpers take a host virtual address, whereas zapping a
GFN obviously provides a guest physical address and with the wrong unit
of measurement (frame vs. byte).
Alternatively, KVM could walk all memslots to get the associated HVAs,
but thanks to SMM, that would require multiple lookups. And practically
speaking, kvm_zap_gfn_range() usage is quite rare and not a hot path,
e.g. MTRR and CR0.CD are almost guaranteed to be done only on vCPU0
during boot, and APICv inhibits are similarly infrequent operations.
Fixes: edb298c663fc ("KVM: x86/mmu: bump mmu notifier count in kvm_zap_gfn_range")
Reported-by: Chao Peng <[email protected]>
Cc: [email protected]
Cc: Maxim Levitsky <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
drivers/net/can/pch_can.c
ae64438be192 ("can: dev: fix skb drop check")
1dd1b521be85 ("can: remove obsolete PCH CAN driver")
https://lore.kernel.org/all/[email protected]/
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Commit b041b525dab9 ("x86/split_lock: Make life miserable for split lockers")
changed the way the split lock detector works when in "warn" mode;
basically, it not only shows the warn message, but also intentionally
introduces a slowdown through sleeping plus serialization mechanism
on such task. Based on discussions in [0], seems the warning alone
wasn't enough motivation for userspace developers to fix their
applications.
This slowdown is enough to totally break some proprietary (aka.
unfixable) userspace[1].
Happens that originally the proposal in [0] was to add a new mode
which would warns + slowdown the "split locking" task, keeping the
old warn mode untouched. In the end, that idea was discarded and
the regular/default "warn" mode now slows down the applications. This
is quite aggressive with regards proprietary/legacy programs that
basically are unable to properly run in kernel with this change.
While it is understandable that a malicious application could DoS
by split locking, it seems unacceptable to regress old/proprietary
userspace programs through a default configuration that previously
worked. An example of such breakage was reported in [1].
Add a sysctl to allow controlling the "misery mode" behavior, as per
Thomas suggestion on [2]. This way, users running legacy and/or
proprietary software are allowed to still execute them with a decent
performance while still observing the warning messages on kernel log.
[0] https://lore.kernel.org/lkml/[email protected]/
[1] https://github.com/doitsujin/dxvk/issues/2938
[2] https://lore.kernel.org/lkml/87pmf4bter.ffs@tglx/
[ dhansen: minor changelog tweaks, including clarifying the actual
problem ]
Fixes: b041b525dab9 ("x86/split_lock: Make life miserable for split lockers")
Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Guilherme G. Piccoli <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Tested-by: Andre Almeida <[email protected]>
Link: https://lore.kernel.org/all/20221024200254.635256-1-gpiccoli%40igalia.com
|
|
Mike Galbraith reported the following against an old fork of preempt-rt
but the same issue also applies to the current preempt-rt tree.
BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:46
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1, name: systemd
preempt_count: 1, expected: 0
RCU nest depth: 0, expected: 0
Preemption disabled at:
fpu_clone
CPU: 6 PID: 1 Comm: systemd Tainted: G E (unreleased)
Call Trace:
<TASK>
dump_stack_lvl
? fpu_clone
__might_resched
rt_spin_lock
fpu_clone
? copy_thread
? copy_process
? shmem_alloc_inode
? kmem_cache_alloc
? kernel_clone
? __do_sys_clone
? do_syscall_64
? __x64_sys_rt_sigprocmask
? syscall_exit_to_user_mode
? do_syscall_64
? syscall_exit_to_user_mode
? do_syscall_64
? syscall_exit_to_user_mode
? do_syscall_64
? exc_page_fault
? entry_SYSCALL_64_after_hwframe
</TASK>
Mike says:
The splat comes from fpu_inherit_perms() being called under fpregs_lock(),
and us reaching the spin_lock_irq() therein due to fpu_state_size_dynamic()
returning true despite static key __fpu_state_size_dynamic having never
been enabled.
Mike's assessment looks correct. fpregs_lock on a PREEMPT_RT kernel disables
preemption so calling spin_lock_irq() in fpu_inherit_perms() is unsafe. This
problem exists since commit
9e798e9aa14c ("x86/fpu: Prepare fpu_clone() for dynamically enabled features").
Even though the original bug report should not have enabled the paths at
all, the bug still exists.
fpregs_lock is necessary when editing the FPU registers or a task's FP
state but it is not necessary for fpu_inherit_perms(). The only write
of any FP state in fpu_inherit_perms() is for the new child which is
not running yet and cannot context switch or be borrowed by a kernel
thread yet. Hence, fpregs_lock is not protecting anything in the new
child until clone() completes and can be dropped earlier. The siglock
still needs to be acquired by fpu_inherit_perms() as the read of the
parent's permissions has to be serialised.
[ bp: Cleanup splat. ]
Fixes: 9e798e9aa14c ("x86/fpu: Prepare fpu_clone() for dynamically enabled features")
Reported-by: Mike Galbraith <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Cc: <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Not all architectures like ARM64 need to override the function. Move
its declaration to kvm_dirty_ring.h to avoid the following compiling
warning on ARM64 when the feature is enabled.
arch/arm64/kvm/../../../virt/kvm/dirty_ring.c:14:12: \
warning: no previous prototype for 'kvm_cpu_dirty_log_size' \
[-Wmissing-prototypes] \
int __weak kvm_cpu_dirty_log_size(void)
Reported-by: kernel test robot <[email protected]>
Signed-off-by: Gavin Shan <[email protected]>
Reviewed-by: Peter Xu <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The VCPU isn't expected to be runnable when the dirty ring becomes soft
full, until the dirty pages are harvested and the dirty ring is reset
from userspace. So there is a check in each guest's entrace to see if
the dirty ring is soft full or not. The VCPU is stopped from running if
its dirty ring has been soft full. The similar check will be needed when
the feature is going to be supported on ARM64. As Marc Zyngier suggested,
a new event will avoid pointless overhead to check the size of the dirty
ring ('vcpu->kvm->dirty_ring_size') in each guest's entrance.
Add KVM_REQ_DIRTY_RING_SOFT_FULL. The event is raised when the dirty ring
becomes soft full in kvm_dirty_ring_push(). The event is only cleared in
the check, done in the newly added helper kvm_dirty_ring_check_request().
Since the VCPU is not runnable when the dirty ring becomes soft full, the
KVM_REQ_DIRTY_RING_SOFT_FULL event is always set to prevent the VCPU from
running until the dirty pages are harvested and the dirty ring is reset by
userspace.
kvm_dirty_ring_soft_full() becomes a private function with the newly added
helper kvm_dirty_ring_check_request(). The alignment for the various event
definitions in kvm_host.h is changed to tab character by the way. In order
to avoid using 'container_of()', the argument @ring is replaced by @vcpu
in kvm_dirty_ring_push().
Link: https://lore.kernel.org/kvmarm/[email protected]
Suggested-by: Marc Zyngier <[email protected]>
Signed-off-by: Gavin Shan <[email protected]>
Reviewed-by: Peter Xu <[email protected]>
Reviewed-by: Sean Christopherson <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The way mtrr_if is initialized with the correct mtrr_ops structure is
quite weird.
Simplify that by dropping the vendor specific init functions and the
mtrr_ops[] array. Replace those with direct assignments of the related
vendor specific ops array to mtrr_if.
Note that a direct assignment is okay even for 64-bit builds, where the
symbol isn't present, as the related code will be subject to "dead code
elimination" due to how cpu_feature_enabled() is implemented.
Signed-off-by: Juergen Gross <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Borislav Petkov <[email protected]>
|
|
Instead of explicitly calling cache_ap_init() in
identify_secondary_cpu() use a CPU hotplug callback instead. By
registering the callback only after having started the non-boot CPUs
and initializing cache_aps_delayed_init with "true", calling
set_cache_aps_delayed_init() at boot time can be dropped.
It should be noted that this change results in cache_ap_init() being
called a little bit later when hotplugging CPUs. By using a new
hotplug slot right at the start of the low level bringup this is not
problematic, as no operations requiring a specific caching mode are
performed that early in CPU initialization.
Suggested-by: Borislav Petkov <[email protected]>
Signed-off-by: Juergen Gross <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Borislav Petkov <[email protected]>
|
|
Today, PAT is usable only with MTRR being active, with some nasty tweaks
to make PAT usable when running as a Xen PV guest which doesn't support
MTRR.
The reason for this coupling is that both PAT MSR changes and MTRR
changes require a similar sequence and so full PAT support was added
using the already available MTRR handling.
Xen PV PAT handling can work without MTRR, as it just needs to consume
the PAT MSR setting done by the hypervisor without the ability and need
to change it. This in turn has resulted in a convoluted initialization
sequence and wrong decisions regarding cache mode availability due to
misguiding PAT availability flags.
Fix all of that by allowing to use PAT without MTRR and by reworking
the current PAT initialization sequence to match better with the newly
introduced generic cache initialization.
This removes the need of the recently added pat_force_disabled flag, so
remove the remnants of the patch adding it.
Signed-off-by: Juergen Gross <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Borislav Petkov <[email protected]>
|
|
Instead of having a stop_machine() handler for either a specific
MTRR register or all state at once, add a handler just for calling
cache_cpu_init() if appropriate.
Add functions for calling stop_machine() with this handler as well.
Add a generic replacement for mtrr_bp_restore() and a wrapper for
mtrr_bp_init().
Signed-off-by: Juergen Gross <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Borislav Petkov <[email protected]>
|
|
In order to prepare decoupling MTRR and PAT replace the MTRR-specific
mtrr_aps_delayed_init flag with a more generic cache_aps_delayed_init
one.
Signed-off-by: Juergen Gross <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Borislav Petkov <[email protected]>
|
|
There is no need for keeping __mtrr_enabled as it can easily be replaced
by testing mtrr_if to be not NULL.
Signed-off-by: Juergen Gross <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Borislav Petkov <[email protected]>
|
|
In case of the generic cache interface being used (Intel CPUs or a
64-bit system), the initialization sequence of the boot CPU is more
complicated than necessary:
- check if MTRR enabled, if yes, call mtrr_bp_pat_init() which will
disable caching, set the PAT MSR, and reenable caching
- call mtrr_cleanup(), in case that changed anything, call
cache_cpu_init() doing the same caching disable/enable dance as
above, but this time with setting the (modified) MTRR state (even
if MTRR was disabled) AND setting the PAT MSR (again even with
disabled MTRR)
The sequence can be simplified a lot while removing potential
inconsistencies:
- check if MTRR enabled, if yes, call mtrr_cleanup() and then
cache_cpu_init()
This ensures to:
- no longer disable/enable caching more than once
- avoid to set MTRRs and/or the PAT MSR on the boot processor in case
of MTRR cleanups even if MTRRs meant to be disabled
With that mtrr_bp_pat_init() can be removed.
Signed-off-by: Juergen Gross <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Borislav Petkov <[email protected]>
|
|
Instead of using an indirect call to mtrr_if->set_all just call the only
possible target cache_cpu_init() directly. Remove the set_all function
pointer from struct mtrr_ops.
Signed-off-by: Juergen Gross <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Borislav Petkov <[email protected]>
|
|
Add a main cache_cpu_init() init routine which initializes MTRR and/or
PAT support depending on what has been detected on the system.
Leave the MTRR-specific initialization in a MTRR-specific init function
where the smp_changes_mask setting happens now with caches disabled.
This global mask update was done with caches enabled before probably
because atomic operations while running uncached might have been quite
expensive.
But since only systems with a broken BIOS should ever require to set any
bit in smp_changes_mask, hurting those devices with a penalty of a few
microseconds during boot shouldn't be a real issue.
Signed-off-by: Juergen Gross <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Borislav Petkov <[email protected]>
|
|
Prepare making PAT and MTRR support independent from each other by
moving some code needed by both out of the MTRR-specific sources.
[ bp: Massage commit message. ]
Signed-off-by: Juergen Gross <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Borislav Petkov <[email protected]>
|
|
Split the MTRR-specific actions from cache_disable() and cache_enable()
into new functions mtrr_disable() and mtrr_enable().
Signed-off-by: Juergen Gross <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Borislav Petkov <[email protected]>
|
|
Rename the currently MTRR-specific functions prepare_set() and
post_set() in preparation to move them. Make them non-static and put
their prototypes into cacheinfo.h, where they will end after moving them
to their final position anyway.
Expand the comment before the functions with an introductory line and
rename two related static variables, too.
[ bp: Massage commit message. ]
Signed-off-by: Juergen Gross <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Borislav Petkov <[email protected]>
|
|
In MTRR code use_intel() is only used in one source file, and the
relevant use_intel_if member of struct mtrr_ops is set only in
generic_mtrr_ops.
Replace use_intel() with a single flag in cacheinfo.c which can be
set when assigning generic_mtrr_ops to mtrr_if. This allows to drop
use_intel_if from mtrr_ops, while preparing to decouple PAT from MTRR.
As another preparation for the PAT/MTRR decoupling use a bit for MTRR
control and one for PAT control. For now set both bits together, this
can be changed later.
As the new flag will be set only if mtrr_enabled is set, the test for
mtrr_enabled can be dropped at some places.
[ bp: Massage commit message. ]
Signed-off-by: Juergen Gross <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Borislav Petkov <[email protected]>
|
|
virt/kvm/irqchip.c is including "irq.h" from the arch-specific KVM source
directory (i.e. not from arch/*/include) for the sole purpose of retrieving
irqchip_in_kernel.
Making the function inline in a header that is already included,
such as asm/kvm_host.h, is not possible because it needs to look at
struct kvm which is defined after asm/kvm_host.h is included. So add a
kvm_arch_irqchip_in_kernel non-inline function; irqchip_in_kernel() is
only performance critical on arm64 and x86, and the non-inline function
is enough on all other architectures.
irq.h can then be deleted from all architectures except x86.
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Defer reprogramming counters and handling overflow via KVM_REQ_PMU
when incrementing counters. KVM skips emulated WRMSR in the VM-Exit
fastpath, the fastpath runs with IRQs disabled, skipping instructions
can increment and reprogram counters, reprogramming counters can
sleep, and sleeping is disallowed while IRQs are disabled.
[*] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:580
[*] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 2981888, name: CPU 15/KVM
[*] preempt_count: 1, expected: 0
[*] RCU nest depth: 0, expected: 0
[*] INFO: lockdep is turned off.
[*] irq event stamp: 0
[*] hardirqs last enabled at (0): [<0000000000000000>] 0x0
[*] hardirqs last disabled at (0): [<ffffffff8121222a>] copy_process+0x146a/0x62d0
[*] softirqs last enabled at (0): [<ffffffff81212269>] copy_process+0x14a9/0x62d0
[*] softirqs last disabled at (0): [<0000000000000000>] 0x0
[*] Preemption disabled at:
[*] [<ffffffffc2063fc1>] vcpu_enter_guest+0x1001/0x3dc0 [kvm]
[*] CPU: 17 PID: 2981888 Comm: CPU 15/KVM Kdump: 5.19.0-rc1-g239111db364c-dirty #2
[*] Call Trace:
[*] <TASK>
[*] dump_stack_lvl+0x6c/0x9b
[*] __might_resched.cold+0x22e/0x297
[*] __mutex_lock+0xc0/0x23b0
[*] perf_event_ctx_lock_nested+0x18f/0x340
[*] perf_event_pause+0x1a/0x110
[*] reprogram_counter+0x2af/0x1490 [kvm]
[*] kvm_pmu_trigger_event+0x429/0x950 [kvm]
[*] kvm_skip_emulated_instruction+0x48/0x90 [kvm]
[*] handle_fastpath_set_msr_irqoff+0x349/0x3b0 [kvm]
[*] vmx_vcpu_run+0x268e/0x3b80 [kvm_intel]
[*] vcpu_enter_guest+0x1d22/0x3dc0 [kvm]
Add a field to kvm_pmc to track the previous counter value in order
to defer overflow detection to kvm_pmu_handle_event() (the counter must
be paused before handling overflow, and that may increment the counter).
Opportunistically shrink sizeof(struct kvm_pmc) a bit.
Suggested-by: Wanpeng Li <[email protected]>
Fixes: 9cd803d496e7 ("KVM: x86: Update vPMCs when retiring instructions")
Signed-off-by: Like Xu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[sean: avoid re-triggering KVM_REQ_PMU on overflow, tweak changelog]
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Batch reprogramming PMU counters by setting KVM_REQ_PMU and thus
deferring reprogramming kvm_pmu_handle_event() to avoid reprogramming
a counter multiple times during a single VM-Exit.
Deferring programming will also allow KVM to fix a bug where immediately
reprogramming a counter can result in sleeping (taking a mutex) while
interrupts are disabled in the VM-Exit fastpath.
Introduce kvm_pmu_request_counter_reprogam() to make it obvious that
KVM is _requesting_ a reprogram and not actually doing the reprogram.
Opportunistically refine related comments to avoid misunderstandings.
Signed-off-by: Like Xu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
When reprogramming a counter, clear the counter's "reprogram pending" bit
if the counter is disabled (by the guest) or is disallowed (by the
userspace filter). In both cases, there's no need to re-attempt
programming on the next coincident KVM_REQ_PMU as enabling the counter by
either method will trigger reprogramming.
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Force vCPUs to reprogram all counters on a PMU filter change to provide
a sane ABI for userspace. Use the existing KVM_REQ_PMU to do the
programming, and take advantage of the fact that the reprogram_pmi bitmap
fits in a u64 to set all bits in a single atomic update. Note, setting
the bitmap and making the request needs to be done _after_ the SRCU
synchronization to ensure that vCPUs will reprogram using the new filter.
KVM's current "lazy" approach is confusing and non-deterministic. It's
confusing because, from a developer perspective, the code is buggy as it
makes zero sense to let userspace modify the filter but then not actually
enforce the new filter. The lazy approach is non-deterministic because
KVM enforces the filter whenever a counter is reprogrammed, not just on
guest WRMSRs, i.e. a guest might gain/lose access to an event at random
times depending on what is going on in the host.
Note, the resulting behavior is still non-determinstic while the filter
is in flux. If userspace wants to guarantee deterministic behavior, all
vCPUs should be paused during the filter update.
Jim Mattson <[email protected]>
Fixes: 66bb8a065f5a ("KVM: x86: PMU Event Filter")
Cc: Aaron Lewis <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Extend the accounting sanity check in kvm_recover_nx_huge_pages() to the
TDP MMU, i.e. verify that zapping a shadow page unaccounts the disallowed
NX huge page regardless of the MMU type. Recovery runs while holding
mmu_lock for write and so it should be impossible to get false positives
on the WARN.
Suggested-by: Yan Zhao <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Explicitly check if a NX huge page is disallowed when determining if a
page fault needs to be forced to use a smaller sized page. KVM currently
assumes that the NX huge page mitigation is the only scenario where KVM
will force a shadow page instead of a huge page, and so unnecessarily
keeps an existing shadow page instead of replacing it with a huge page.
Any scenario that causes KVM to zap leaf SPTEs may result in having a SP
that can be made huge without violating the NX huge page mitigation.
E.g. prior to commit 5ba7c4c6d1c7 ("KVM: x86/MMU: Zap non-leaf SPTEs when
disabling dirty logging"), KVM would keep shadow pages after disabling
dirty logging due to a live migration being canceled, resulting in
degraded performance due to running with 4kb pages instead of huge pages.
Although the dirty logging case is "fixed", that fix is coincidental,
i.e. is an implementation detail, and there are other scenarios where KVM
will zap leaf SPTEs. E.g. zapping leaf SPTEs in response to a host page
migration (mmu_notifier invalidation) to create a huge page would yield a
similar result; KVM would see the shadow-present non-leaf SPTE and assume
a huge page is disallowed.
Fixes: b8e8c8303ff2 ("kvm: mmu: ITLB_MULTIHIT mitigation")
Reviewed-by: Ben Gardon <[email protected]>
Reviewed-by: David Matlack <[email protected]>
Signed-off-by: Mingwei Zhang <[email protected]>
[sean: use spte_to_child_sp(), massage changelog, fold into if-statement]
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Yan Zhao <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Add a helper to convert a SPTE to its shadow page to deduplicate a
variety of flows and hopefully avoid future bugs, e.g. if KVM attempts to
get the shadow page for a SPTE without dropping high bits.
Opportunistically add a comment in mmu_free_root_page() documenting why
it treats the root HPA as a SPTE.
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Track the number of TDP MMU "shadow" pages instead of tracking the pages
themselves. With the NX huge page list manipulation moved out of the common
linking flow, elminating the list-based tracking means the happy path of
adding a shadow page doesn't need to acquire a spinlock and can instead
inc/dec an atomic.
Keep the tracking as the WARN during TDP MMU teardown on leaked shadow
pages is very, very useful for detecting KVM bugs.
Tracking the number of pages will also make it trivial to expose the
counter to userspace as a stat in the future, which may or may not be
desirable.
Note, the TDP MMU needs to use a separate counter (and stat if that ever
comes to be) from the existing n_used_mmu_pages. The TDP MMU doesn't bother
supporting the shrinker nor does it honor KVM_SET_NR_MMU_PAGES (because the
TDP MMU consumes so few pages relative to shadow paging), and including TDP
MMU pages in that counter would break both the shrinker and shadow MMUs,
e.g. if a VM is using nested TDP.
Cc: Yan Zhao <[email protected]>
Reviewed-by: Mingwei Zhang <[email protected]>
Reviewed-by: David Matlack <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Yan Zhao <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Set nx_huge_page_disallowed in TDP MMU shadow pages before making the SP
visible to other readers, i.e. before setting its SPTE. This will allow
KVM to query the flag when determining if a shadow page can be replaced
by a NX huge page without violating the rules of the mitigation.
Note, the shadow/legacy MMU holds mmu_lock for write, so it's impossible
for another CPU to see a shadow page without an up-to-date
nx_huge_page_disallowed, i.e. only the TDP MMU needs the complicated
dance.
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: David Matlack <[email protected]>
Reviewed-by: Yan Zhao <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Account and track NX huge pages for nonpaging MMUs so that a future
enhancement to precisely check if a shadow page can't be replaced by a NX
huge page doesn't get false positives. Without correct tracking, KVM can
get stuck in a loop if an instruction is fetching and writing data on the
same huge page, e.g. KVM installs a small executable page on the fetch
fault, replaces it with an NX huge page on the write fault, and faults
again on the fetch.
Alternatively, and perhaps ideally, KVM would simply not enforce the
workaround for nonpaging MMUs. The guest has no page tables to abuse
and KVM is guaranteed to switch to a different MMU on CR0.PG being
toggled so there's no security or performance concerns. However, getting
make_spte() to play nice now and in the future is unnecessarily complex.
In the current code base, make_spte() can enforce the mitigation if TDP
is enabled or the MMU is indirect, but make_spte() may not always have a
vCPU/MMU to work with, e.g. if KVM were to support in-line huge page
promotion when disabling dirty logging.
Without a vCPU/MMU, KVM could either pass in the correct information
and/or derive it from the shadow page, but the former is ugly and the
latter subtly non-trivial due to the possibility of direct shadow pages
in indirect MMUs. Given that using shadow paging with an unpaged guest
is far from top priority _and_ has been subjected to the workaround since
its inception, keep it simple and just fix the accounting glitch.
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: David Matlack <[email protected]>
Reviewed-by: Mingwei Zhang <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Rename most of the variables/functions involved in the NX huge page
mitigation to provide consistency, e.g. lpage vs huge page, and NX huge
vs huge NX, and also to provide clarity, e.g. to make it obvious the flag
applies only to the NX huge page mitigation, not to any condition that
prevents creating a huge page.
Add a comment explaining what the newly named "possible_nx_huge_pages"
tracks.
Leave the nx_lpage_splits stat alone as the name is ABI and thus set in
stone.
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Mingwei Zhang <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Tag shadow pages that cannot be replaced with an NX huge page regardless
of whether or not zapping the page would allow KVM to immediately create
a huge page, e.g. because something else prevents creating a huge page.
I.e. track pages that are disallowed from being NX huge pages regardless
of whether or not the page could have been huge at the time of fault.
KVM currently tracks pages that were disallowed from being huge due to
the NX workaround if and only if the page could otherwise be huge. But
that fails to handled the scenario where whatever restriction prevented
KVM from installing a huge page goes away, e.g. if dirty logging is
disabled, the host mapping level changes, etc...
Failure to tag shadow pages appropriately could theoretically lead to
false negatives, e.g. if a fetch fault requests a small page and thus
isn't tracked, and a read/write fault later requests a huge page, KVM
will not reject the huge page as it should.
To avoid yet another flag, initialize the list_head and use list_empty()
to determine whether or not a page is on the list of NX huge pages that
should be recovered.
Note, the TDP MMU accounting is still flawed as fixing the TDP MMU is
more involved due to mmu_lock being held for read. This will be
addressed in a future commit.
Fixes: 5bcaf3e1715f ("KVM: x86/mmu: Account NX huge page disallowed iff huge page was requested")
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Add the mask KVM_MSR_FILTER_RANGE_VALID_MASK for the flags in the
struct kvm_msr_filter_range. This simplifies checks that validate
these flags, and makes it easier to introduce new flags in the future.
No functional change intended.
Signed-off-by: Aaron Lewis <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Add the mask KVM_MSR_FILTER_VALID_MASK for the flag in the struct
kvm_msr_filter. This makes it easier to introduce new flags in the
future.
No functional change intended.
Signed-off-by: Aaron Lewis <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Add the mask KVM_MSR_EXIT_REASON_VALID_MASK for the MSR exit reason
flags. This simplifies checks that validate these flags, and makes it
easier to introduce new flags in the future.
No functional change intended.
Signed-off-by: Aaron Lewis <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Protect the kernel from using the flag KVM_MSR_FILTER_DEFAULT_ALLOW.
Its value is 0, and using it incorrectly could have unintended
consequences. E.g. prevent someone in the kernel from writing something
like this.
if (filter.flags & KVM_MSR_FILTER_DEFAULT_ALLOW)
<allow the MSR>
and getting confused when it doesn't work.
It would be more ideal to remove this flag altogether, but userspace
may already be using it, so protecting the kernel is all that can
reasonably be done at this point.
Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Aaron Lewis <[email protected]>
Reviewed-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Enable x86 slow page faults to be able to respond to non-fatal signals,
returning -EINTR properly when it happens.
Signed-off-by: Peter Xu <[email protected]>
Reviewed-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Add a new "interruptible" flag showing that the caller is willing to be
interrupted by signals during the __gfn_to_pfn_memslot() request. Wire it
up with a FOLL_INTERRUPTIBLE flag that we've just introduced.
This prepares KVM to be able to respond to SIGUSR1 (for QEMU that's the
SIGIPI) even during e.g. handling an userfaultfd page fault.
No functional change intended.
Signed-off-by: Peter Xu <[email protected]>
Reviewed-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
When #SMI is asserted, the CPU can be in interrupt shadow due to sti or
mov ss.
It is not mandatory in Intel/AMD prm to have the #SMI blocked during the
shadow, and on top of that, since neither SVM nor VMX has true support
for SMI window, waiting for one instruction would mean single stepping
the guest.
Instead, allow #SMI in this case, but both reset the interrupt window and
stash its value in SMRAM to restore it on exit from SMM.
This fixes rare failures seen mostly on windows guests on VMX, when #SMI
falls on the sti instruction which mainfest in VM entry failure due
to EFLAGS.IF not being set, but STI interrupt window still being set
in the VMCS.
Signed-off-by: Maxim Levitsky <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
When the guest CPUID doesn't have support for long mode, 32 bit SMRAM
layout is used and it has no support for preserving EFER and/or SVM
state.
Note that this isn't relevant to running 32 bit guests on VM which is
long mode capable - such VM can still run 32 bit guests in compatibility
mode.
Signed-off-by: Maxim Levitsky <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Use SMM structs in the SVM code as well, which removes the last user of
put_smstate/GET_SMSTATE so remove these macros as well.
Signed-off-by: Maxim Levitsky <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
if kvm_vcpu_map returns non zero value, error path should be triggered
regardless of the exact returned error value.
Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Maxim Levitsky <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Use kvm_smram_state_64 struct to save/restore the 64 bit SMM state
(used when X86_FEATURE_LM is present in the guest CPUID,
regardless of 32-bitness of the guest).
Signed-off-by: Maxim Levitsky <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Use kvm_smram_state_32 struct to save/restore 32 bit SMM state
(used when X86_FEATURE_LM is not present in the guest CPUID).
Signed-off-by: Maxim Levitsky <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Use kvm_smram union instad of raw arrays in the common smm code.
Signed-off-by: Maxim Levitsky <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|