aboutsummaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)AuthorFilesLines
2022-11-11Merge tag 'mm-hotfixes-stable-2022-11-11' of ↵Linus Torvalds3-0/+14
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc hotfixes from Andrew Morton: "22 hotfixes. Eight are cc:stable and the remainder address issues which were introduced post-6.0 or which aren't considered serious enough to justify a -stable backport" * tag 'mm-hotfixes-stable-2022-11-11' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (22 commits) docs: kmsan: fix formatting of "Example report" mm/damon/dbgfs: check if rm_contexts input is for a real context maple_tree: don't set a new maximum on the node when not reusing nodes maple_tree: fix depth tracking in maple_state arch/x86/mm/hugetlbpage.c: pud_huge() returns 0 when using 2-level paging fs: fix leaked psi pressure state nilfs2: fix use-after-free bug of ns_writer on remount x86/traps: avoid KMSAN bugs originating from handle_bug() kmsan: make sure PREEMPT_RT is off Kconfig.debug: ensure early check for KMSAN in CONFIG_KMSAN_WARN x86/uaccess: instrument copy_from_user_nmi() kmsan: core: kmsan_in_runtime() should return true in NMI context mm: hugetlb_vmemmap: include missing linux/moduleparam.h mm/shmem: use page_mapping() to detect page cache for uffd continue mm/memremap.c: map FS_DAX device memory as decrypted Partly revert "mm/thp: carry over dirty bit when thp splits on pmd" nilfs2: fix deadlock in nilfs_count_free_blocks() mm/mmap: fix memory leak in mmap_region() hugetlbfs: don't delete error page from pagecache maple_tree: reorganize testing to restore module testing ...
2022-11-11x86/hyperv: Restore VP assist page after cpu offlining/onliningVitaly Kuznetsov1-28/+26
Commit e5d9b714fe40 ("x86/hyperv: fix root partition faults when writing to VP assist page MSR") moved 'wrmsrl(HV_X64_MSR_VP_ASSIST_PAGE)' under 'if (*hvp)' condition. This works for root partition as hv_cpu_die() does memunmap() and sets 'hv_vp_assist_page[cpu]' to NULL but breaks non-root partitions as hv_cpu_die() doesn't free 'hv_vp_assist_page[cpu]' for them. This causes VP assist page to remain unset after CPU offline/online cycle: $ rdmsr -p 24 0x40000073 10212f001 $ echo 0 > /sys/devices/system/cpu/cpu24/online $ echo 1 > /sys/devices/system/cpu/cpu24/online $ rdmsr -p 24 0x40000073 0 Fix the issue by always writing to HV_X64_MSR_VP_ASSIST_PAGE in hv_cpu_init(). Note, checking 'if (!*hvp)', for root partition is pointless as hv_cpu_die() always sets 'hv_vp_assist_page[cpu]' to NULL (and it's also NULL initially). Note: the fact that 'hv_vp_assist_page[cpu]' is reset to NULL may present a (potential) issue for KVM. While Hyper-V uses CPUHP_AP_ONLINE_DYN stage in CPU hotplug, KVM uses CPUHP_AP_KVM_STARTING which comes earlier in CPU teardown sequence. It is theoretically possible that Enlightened VMCS is still in use. It is unclear if the issue is real and if using KVM with Hyper-V root partition is even possible. While on it, drop the unneeded smp_processor_id() call from hv_cpu_init(). Fixes: e5d9b714fe40 ("x86/hyperv: fix root partition faults when writing to VP assist page MSR") Signed-off-by: Vitaly Kuznetsov <[email protected]> Reviewed-by: Michael Kelley <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Wei Liu <[email protected]>
2022-11-11x86/acpi/cstate: Optimize ARB_DISABLE on Centaur CPUsTony W Wang-oc1-8/+16
On all recent Centaur platforms, ARB_DISABLE is handled by PMU automatically while entering C3 type state. No need for OS to issue the ARB_DISABLE, so set bm_control to zero to indicate that. Signed-off-by: Tony W Wang-oc <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Acked-by: Rafael J. Wysocki <[email protected]> Link: https://lore.kernel.org/all/1667792089-4904-1-git-send-email-TonyWWang-oc%40zhaoxin.com
2022-11-11Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds18-181/+331
Pull kvm "This is a pretty large diffstat for this time of the release. The main culprit is a reorganization of the AMD assembly trampoline, allowing percpu variables to be accessed early. This is needed for the return stack depth tracking retbleed mitigation that will be in 6.2, but it also makes it possible to tighten the IBRS restore on vmexit. The latter change is a long tail of the spectrev2/retbleed patches (the corresponding Intel change was simpler and went in already last June), which is why I am including it right now instead of sharing a topic branch with tip. Being assembly and being rich in comments makes the line count balloon a bit, but I am pretty confident in the change (famous last words) because the reorganization actually makes everything simpler and more understandable than before. It has also had external review and has been tested on the aforementioned 6.2 changes, which explode quite brutally without the fix. Apart from this, things are pretty normal. s390: - PCI fix - PV clock fix x86: - Fix clash between PMU MSRs and other MSRs - Prepare SVM assembly trampoline for 6.2 retbleed mitigation and for... - ... tightening IBRS restore on vmexit, moving it before the first RET or indirect branch - Fix log level for VMSA dump - Block all page faults during kvm_zap_gfn_range() Tools: - kvm_stat: fix incorrect detection of debugfs - kvm_stat: update vmexit definitions" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: x86/mmu: Block all page faults during kvm_zap_gfn_range() KVM: x86/pmu: Limit the maximum number of supported AMD GP counters KVM: x86/pmu: Limit the maximum number of supported Intel GP counters KVM: x86/pmu: Do not speculatively query Intel GP PMCs that don't exist yet KVM: SVM: Only dump VMSA to klog at KERN_DEBUG level tools/kvm_stat: update exit reasons for vmx/svm/aarch64/userspace tools/kvm_stat: fix incorrect detection of debugfs x86, KVM: remove unnecessary argument to x86_virt_spec_ctrl and callers KVM: SVM: move MSR_IA32_SPEC_CTRL save/restore to assembly KVM: SVM: restore host save area from assembly KVM: SVM: move guest vmsave/vmload back to assembly KVM: SVM: do not allocate struct svm_cpu_data dynamically KVM: SVM: remove dead field from struct svm_cpu_data KVM: SVM: remove unused field from struct vcpu_svm KVM: SVM: retrieve VMCB from assembly KVM: SVM: adjust register allocation for __svm_vcpu_run() KVM: SVM: replace regs argument of __svm_vcpu_run() with vcpu_svm KVM: x86: use a separate asm-offsets.c file KVM: s390: pci: Fix allocation size of aift kzdev elements KVM: s390: pv: don't allow userspace to set the clock under PV
2022-11-11Merge tag 'hyperv-fixes-signed-20221110' of ↵Linus Torvalds1-9/+10
git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull hyperv fixes from Wei Liu: - Fix TSC MSR write for root partition (Anirudh Rayabharam) - Fix definition of vector in pci-hyperv driver (Dexuan Cui) - A few other misc patches * tag 'hyperv-fixes-signed-20221110' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: PCI: hv: Fix the definition of vector in hv_compose_msi_msg() MAINTAINERS: remove sthemmin x86/hyperv: fix invalid writes to MSRs during root partition kexec clocksource/drivers/hyperv: add data structure for reference TSC MSR Drivers: hv: fix repeated words in comments x86/hyperv: Remove BUG_ON() for kmap_local_page()
2022-11-11KVM: x86/mmu: Block all page faults during kvm_zap_gfn_range()Sean Christopherson1-2/+2
When zapping a GFN range, pass 0 => ALL_ONES for the to-be-invalidated range to effectively block all page faults while the zap is in-progress. The invalidation helpers take a host virtual address, whereas zapping a GFN obviously provides a guest physical address and with the wrong unit of measurement (frame vs. byte). Alternatively, KVM could walk all memslots to get the associated HVAs, but thanks to SMM, that would require multiple lookups. And practically speaking, kvm_zap_gfn_range() usage is quite rare and not a hot path, e.g. MTRR and CR0.CD are almost guaranteed to be done only on vCPU0 during boot, and APICv inhibits are similarly infrequent operations. Fixes: edb298c663fc ("KVM: x86/mmu: bump mmu notifier count in kvm_zap_gfn_range") Reported-by: Chao Peng <[email protected]> Cc: [email protected] Cc: Maxim Levitsky <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-11-10Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski12-60/+70
drivers/net/can/pch_can.c ae64438be192 ("can: dev: fix skb drop check") 1dd1b521be85 ("can: remove obsolete PCH CAN driver") https://lore.kernel.org/all/[email protected]/ Signed-off-by: Jakub Kicinski <[email protected]>
2022-11-10x86/split_lock: Add sysctl to control the misery modeGuilherme G. Piccoli1-10/+53
Commit b041b525dab9 ("x86/split_lock: Make life miserable for split lockers") changed the way the split lock detector works when in "warn" mode; basically, it not only shows the warn message, but also intentionally introduces a slowdown through sleeping plus serialization mechanism on such task. Based on discussions in [0], seems the warning alone wasn't enough motivation for userspace developers to fix their applications. This slowdown is enough to totally break some proprietary (aka. unfixable) userspace[1]. Happens that originally the proposal in [0] was to add a new mode which would warns + slowdown the "split locking" task, keeping the old warn mode untouched. In the end, that idea was discarded and the regular/default "warn" mode now slows down the applications. This is quite aggressive with regards proprietary/legacy programs that basically are unable to properly run in kernel with this change. While it is understandable that a malicious application could DoS by split locking, it seems unacceptable to regress old/proprietary userspace programs through a default configuration that previously worked. An example of such breakage was reported in [1]. Add a sysctl to allow controlling the "misery mode" behavior, as per Thomas suggestion on [2]. This way, users running legacy and/or proprietary software are allowed to still execute them with a decent performance while still observing the warning messages on kernel log. [0] https://lore.kernel.org/lkml/[email protected]/ [1] https://github.com/doitsujin/dxvk/issues/2938 [2] https://lore.kernel.org/lkml/87pmf4bter.ffs@tglx/ [ dhansen: minor changelog tweaks, including clarifying the actual problem ] Fixes: b041b525dab9 ("x86/split_lock: Make life miserable for split lockers") Suggested-by: Thomas Gleixner <[email protected]> Signed-off-by: Guilherme G. Piccoli <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Tony Luck <[email protected]> Tested-by: Andre Almeida <[email protected]> Link: https://lore.kernel.org/all/20221024200254.635256-1-gpiccoli%40igalia.com
2022-11-10x86/fpu: Drop fpregs lock before inheriting FPU permissionsMel Gorman1-1/+1
Mike Galbraith reported the following against an old fork of preempt-rt but the same issue also applies to the current preempt-rt tree. BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:46 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1, name: systemd preempt_count: 1, expected: 0 RCU nest depth: 0, expected: 0 Preemption disabled at: fpu_clone CPU: 6 PID: 1 Comm: systemd Tainted: G E (unreleased) Call Trace: <TASK> dump_stack_lvl ? fpu_clone __might_resched rt_spin_lock fpu_clone ? copy_thread ? copy_process ? shmem_alloc_inode ? kmem_cache_alloc ? kernel_clone ? __do_sys_clone ? do_syscall_64 ? __x64_sys_rt_sigprocmask ? syscall_exit_to_user_mode ? do_syscall_64 ? syscall_exit_to_user_mode ? do_syscall_64 ? syscall_exit_to_user_mode ? do_syscall_64 ? exc_page_fault ? entry_SYSCALL_64_after_hwframe </TASK> Mike says: The splat comes from fpu_inherit_perms() being called under fpregs_lock(), and us reaching the spin_lock_irq() therein due to fpu_state_size_dynamic() returning true despite static key __fpu_state_size_dynamic having never been enabled. Mike's assessment looks correct. fpregs_lock on a PREEMPT_RT kernel disables preemption so calling spin_lock_irq() in fpu_inherit_perms() is unsafe. This problem exists since commit 9e798e9aa14c ("x86/fpu: Prepare fpu_clone() for dynamically enabled features"). Even though the original bug report should not have enabled the paths at all, the bug still exists. fpregs_lock is necessary when editing the FPU registers or a task's FP state but it is not necessary for fpu_inherit_perms(). The only write of any FP state in fpu_inherit_perms() is for the new child which is not running yet and cannot context switch or be borrowed by a kernel thread yet. Hence, fpregs_lock is not protecting anything in the new child until clone() completes and can be dropped earlier. The siglock still needs to be acquired by fpu_inherit_perms() as the read of the parent's permissions has to be serialised. [ bp: Cleanup splat. ] Fixes: 9e798e9aa14c ("x86/fpu: Prepare fpu_clone() for dynamically enabled features") Reported-by: Mike Galbraith <[email protected]> Signed-off-by: Mel Gorman <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Cc: <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-11-10KVM: Move declaration of kvm_cpu_dirty_log_size() to kvm_dirty_ring.hGavin Shan1-2/+0
Not all architectures like ARM64 need to override the function. Move its declaration to kvm_dirty_ring.h to avoid the following compiling warning on ARM64 when the feature is enabled. arch/arm64/kvm/../../../virt/kvm/dirty_ring.c:14:12: \ warning: no previous prototype for 'kvm_cpu_dirty_log_size' \ [-Wmissing-prototypes] \ int __weak kvm_cpu_dirty_log_size(void) Reported-by: kernel test robot <[email protected]> Signed-off-by: Gavin Shan <[email protected]> Reviewed-by: Peter Xu <[email protected]> Signed-off-by: Marc Zyngier <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-11-10KVM: x86: Introduce KVM_REQ_DIRTY_RING_SOFT_FULLGavin Shan1-9/+6
The VCPU isn't expected to be runnable when the dirty ring becomes soft full, until the dirty pages are harvested and the dirty ring is reset from userspace. So there is a check in each guest's entrace to see if the dirty ring is soft full or not. The VCPU is stopped from running if its dirty ring has been soft full. The similar check will be needed when the feature is going to be supported on ARM64. As Marc Zyngier suggested, a new event will avoid pointless overhead to check the size of the dirty ring ('vcpu->kvm->dirty_ring_size') in each guest's entrance. Add KVM_REQ_DIRTY_RING_SOFT_FULL. The event is raised when the dirty ring becomes soft full in kvm_dirty_ring_push(). The event is only cleared in the check, done in the newly added helper kvm_dirty_ring_check_request(). Since the VCPU is not runnable when the dirty ring becomes soft full, the KVM_REQ_DIRTY_RING_SOFT_FULL event is always set to prevent the VCPU from running until the dirty pages are harvested and the dirty ring is reset by userspace. kvm_dirty_ring_soft_full() becomes a private function with the newly added helper kvm_dirty_ring_check_request(). The alignment for the various event definitions in kvm_host.h is changed to tab character by the way. In order to avoid using 'container_of()', the argument @ring is replaced by @vcpu in kvm_dirty_ring_push(). Link: https://lore.kernel.org/kvmarm/[email protected] Suggested-by: Marc Zyngier <[email protected]> Signed-off-by: Gavin Shan <[email protected]> Reviewed-by: Peter Xu <[email protected]> Reviewed-by: Sean Christopherson <[email protected]> Signed-off-by: Marc Zyngier <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-11-10x86/mtrr: Simplify mtrr_ops initializationJuergen Gross5-54/+10
The way mtrr_if is initialized with the correct mtrr_ops structure is quite weird. Simplify that by dropping the vendor specific init functions and the mtrr_ops[] array. Replace those with direct assignments of the related vendor specific ops array to mtrr_if. Note that a direct assignment is okay even for 64-bit builds, where the symbol isn't present, as the related code will be subject to "dead code elimination" due to how cpu_feature_enabled() is implemented. Signed-off-by: Juergen Gross <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Borislav Petkov <[email protected]>
2022-11-10x86/cacheinfo: Switch cache_ap_init() to hotplug callbackJuergen Gross4-7/+15
Instead of explicitly calling cache_ap_init() in identify_secondary_cpu() use a CPU hotplug callback instead. By registering the callback only after having started the non-boot CPUs and initializing cache_aps_delayed_init with "true", calling set_cache_aps_delayed_init() at boot time can be dropped. It should be noted that this change results in cache_ap_init() being called a little bit later when hotplugging CPUs. By using a new hotplug slot right at the start of the low level bringup this is not problematic, as no operations requiring a specific caching mode are performed that early in CPU initialization. Suggested-by: Borislav Petkov <[email protected]> Signed-off-by: Juergen Gross <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Borislav Petkov <[email protected]>
2022-11-10x86: Decouple PAT and MTRR handlingJuergen Gross5-128/+57
Today, PAT is usable only with MTRR being active, with some nasty tweaks to make PAT usable when running as a Xen PV guest which doesn't support MTRR. The reason for this coupling is that both PAT MSR changes and MTRR changes require a similar sequence and so full PAT support was added using the already available MTRR handling. Xen PV PAT handling can work without MTRR, as it just needs to consume the PAT MSR setting done by the hypervisor without the ability and need to change it. This in turn has resulted in a convoluted initialization sequence and wrong decisions regarding cache mode availability due to misguiding PAT availability flags. Fix all of that by allowing to use PAT without MTRR and by reworking the current PAT initialization sequence to match better with the newly introduced generic cache initialization. This removes the need of the recently added pat_force_disabled flag, so remove the remnants of the patch adding it. Signed-off-by: Juergen Gross <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Borislav Petkov <[email protected]>
2022-11-10x86/mtrr: Add a stop_machine() handler calling only cache_cpu_init()Juergen Gross8-99/+74
Instead of having a stop_machine() handler for either a specific MTRR register or all state at once, add a handler just for calling cache_cpu_init() if appropriate. Add functions for calling stop_machine() with this handler as well. Add a generic replacement for mtrr_bp_restore() and a wrapper for mtrr_bp_init(). Signed-off-by: Juergen Gross <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Borislav Petkov <[email protected]>
2022-11-10x86/mtrr: Let cache_aps_delayed_init replace mtrr_aps_delayed_initJuergen Gross5-17/+22
In order to prepare decoupling MTRR and PAT replace the MTRR-specific mtrr_aps_delayed_init flag with a more generic cache_aps_delayed_init one. Signed-off-by: Juergen Gross <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Borislav Petkov <[email protected]>
2022-11-10x86/mtrr: Get rid of __mtrr_enabled boolJuergen Gross1-8/+5
There is no need for keeping __mtrr_enabled as it can easily be replaced by testing mtrr_if to be not NULL. Signed-off-by: Juergen Gross <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Borislav Petkov <[email protected]>
2022-11-10x86/mtrr: Simplify mtrr_bp_init()Juergen Gross3-20/+1
In case of the generic cache interface being used (Intel CPUs or a 64-bit system), the initialization sequence of the boot CPU is more complicated than necessary: - check if MTRR enabled, if yes, call mtrr_bp_pat_init() which will disable caching, set the PAT MSR, and reenable caching - call mtrr_cleanup(), in case that changed anything, call cache_cpu_init() doing the same caching disable/enable dance as above, but this time with setting the (modified) MTRR state (even if MTRR was disabled) AND setting the PAT MSR (again even with disabled MTRR) The sequence can be simplified a lot while removing potential inconsistencies: - check if MTRR enabled, if yes, call mtrr_cleanup() and then cache_cpu_init() This ensures to: - no longer disable/enable caching more than once - avoid to set MTRRs and/or the PAT MSR on the boot processor in case of MTRR cleanups even if MTRRs meant to be disabled With that mtrr_bp_pat_init() can be removed. Signed-off-by: Juergen Gross <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Borislav Petkov <[email protected]>
2022-11-10x86/mtrr: Remove set_all callback from struct mtrr_opsJuergen Gross3-8/+5
Instead of using an indirect call to mtrr_if->set_all just call the only possible target cache_cpu_init() directly. Remove the set_all function pointer from struct mtrr_ops. Signed-off-by: Juergen Gross <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Borislav Petkov <[email protected]>
2022-11-10x86/mtrr: Disentangle MTRR init from PAT initJuergen Gross4-13/+22
Add a main cache_cpu_init() init routine which initializes MTRR and/or PAT support depending on what has been detected on the system. Leave the MTRR-specific initialization in a MTRR-specific init function where the smp_changes_mask setting happens now with caches disabled. This global mask update was done with caches enabled before probably because atomic operations while running uncached might have been quite expensive. But since only systems with a broken BIOS should ever require to set any bit in smp_changes_mask, hurting those devices with a penalty of a few microseconds during boot shouldn't be a real issue. Signed-off-by: Juergen Gross <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Borislav Petkov <[email protected]>
2022-11-10x86/mtrr: Move cache control code to cacheinfo.cJuergen Gross2-74/+77
Prepare making PAT and MTRR support independent from each other by moving some code needed by both out of the MTRR-specific sources. [ bp: Massage commit message. ] Signed-off-by: Juergen Gross <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Borislav Petkov <[email protected]>
2022-11-10x86/mtrr: Split MTRR-specific handling from cache dis/enablingJuergen Gross2-7/+23
Split the MTRR-specific actions from cache_disable() and cache_enable() into new functions mtrr_disable() and mtrr_enable(). Signed-off-by: Juergen Gross <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Borislav Petkov <[email protected]>
2022-11-10x86/mtrr: Rename prepare_set() and post_set()Juergen Gross2-22/+24
Rename the currently MTRR-specific functions prepare_set() and post_set() in preparation to move them. Make them non-static and put their prototypes into cacheinfo.h, where they will end after moving them to their final position anyway. Expand the comment before the functions with an introductory line and rename two related static variables, too. [ bp: Massage commit message. ] Signed-off-by: Juergen Gross <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Borislav Petkov <[email protected]>
2022-11-10x86/mtrr: Replace use_intel() with a local flagJuergen Gross5-18/+21
In MTRR code use_intel() is only used in one source file, and the relevant use_intel_if member of struct mtrr_ops is set only in generic_mtrr_ops. Replace use_intel() with a single flag in cacheinfo.c which can be set when assigning generic_mtrr_ops to mtrr_if. This allows to drop use_intel_if from mtrr_ops, while preparing to decouple PAT from MTRR. As another preparation for the PAT/MTRR decoupling use a bit for MTRR control and one for PAT control. For now set both bits together, this can be changed later. As the new flag will be set only if mtrr_enabled is set, the test for mtrr_enabled can be dropped at some places. [ bp: Massage commit message. ] Signed-off-by: Juergen Gross <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Borislav Petkov <[email protected]>
2022-11-09KVM: replace direct irq.h inclusionPaolo Bonzini1-0/+5
virt/kvm/irqchip.c is including "irq.h" from the arch-specific KVM source directory (i.e. not from arch/*/include) for the sole purpose of retrieving irqchip_in_kernel. Making the function inline in a header that is already included, such as asm/kvm_host.h, is not possible because it needs to look at struct kvm which is defined after asm/kvm_host.h is included. So add a kvm_arch_irqchip_in_kernel non-inline function; irqchip_in_kernel() is only performance critical on arm64 and x86, and the non-inline function is enough on all other architectures. irq.h can then be deleted from all architectures except x86. Signed-off-by: Paolo Bonzini <[email protected]>
2022-11-09KVM: x86/pmu: Defer counter emulated overflow via pmc->prev_counterLike Xu4-21/+22
Defer reprogramming counters and handling overflow via KVM_REQ_PMU when incrementing counters. KVM skips emulated WRMSR in the VM-Exit fastpath, the fastpath runs with IRQs disabled, skipping instructions can increment and reprogram counters, reprogramming counters can sleep, and sleeping is disallowed while IRQs are disabled. [*] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:580 [*] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 2981888, name: CPU 15/KVM [*] preempt_count: 1, expected: 0 [*] RCU nest depth: 0, expected: 0 [*] INFO: lockdep is turned off. [*] irq event stamp: 0 [*] hardirqs last enabled at (0): [<0000000000000000>] 0x0 [*] hardirqs last disabled at (0): [<ffffffff8121222a>] copy_process+0x146a/0x62d0 [*] softirqs last enabled at (0): [<ffffffff81212269>] copy_process+0x14a9/0x62d0 [*] softirqs last disabled at (0): [<0000000000000000>] 0x0 [*] Preemption disabled at: [*] [<ffffffffc2063fc1>] vcpu_enter_guest+0x1001/0x3dc0 [kvm] [*] CPU: 17 PID: 2981888 Comm: CPU 15/KVM Kdump: 5.19.0-rc1-g239111db364c-dirty #2 [*] Call Trace: [*] <TASK> [*] dump_stack_lvl+0x6c/0x9b [*] __might_resched.cold+0x22e/0x297 [*] __mutex_lock+0xc0/0x23b0 [*] perf_event_ctx_lock_nested+0x18f/0x340 [*] perf_event_pause+0x1a/0x110 [*] reprogram_counter+0x2af/0x1490 [kvm] [*] kvm_pmu_trigger_event+0x429/0x950 [kvm] [*] kvm_skip_emulated_instruction+0x48/0x90 [kvm] [*] handle_fastpath_set_msr_irqoff+0x349/0x3b0 [kvm] [*] vmx_vcpu_run+0x268e/0x3b80 [kvm_intel] [*] vcpu_enter_guest+0x1d22/0x3dc0 [kvm] Add a field to kvm_pmc to track the previous counter value in order to defer overflow detection to kvm_pmu_handle_event() (the counter must be paused before handling overflow, and that may increment the counter). Opportunistically shrink sizeof(struct kvm_pmc) a bit. Suggested-by: Wanpeng Li <[email protected]> Fixes: 9cd803d496e7 ("KVM: x86: Update vPMCs when retiring instructions") Signed-off-by: Like Xu <[email protected]> Link: https://lore.kernel.org/r/[email protected] [sean: avoid re-triggering KVM_REQ_PMU on overflow, tweak changelog] Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-11-09KVM: x86/pmu: Defer reprogram_counter() to kvm_pmu_handle_event()Like Xu5-10/+22
Batch reprogramming PMU counters by setting KVM_REQ_PMU and thus deferring reprogramming kvm_pmu_handle_event() to avoid reprogramming a counter multiple times during a single VM-Exit. Deferring programming will also allow KVM to fix a bug where immediately reprogramming a counter can result in sleeping (taking a mutex) while interrupts are disabled in the VM-Exit fastpath. Introduce kvm_pmu_request_counter_reprogam() to make it obvious that KVM is _requesting_ a reprogram and not actually doing the reprogram. Opportunistically refine related comments to avoid misunderstandings. Signed-off-by: Like Xu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-11-09KVM: x86/pmu: Clear "reprogram" bit if counter is disabled or disallowedSean Christopherson1-14/+24
When reprogramming a counter, clear the counter's "reprogram pending" bit if the counter is disabled (by the guest) or is disallowed (by the userspace filter). In both cases, there's no need to re-attempt programming on the next coincident KVM_REQ_PMU as enabling the counter by either method will trigger reprogramming. Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-11-09KVM: x86/pmu: Force reprogramming of all counters on PMU filter changeSean Christopherson2-2/+22
Force vCPUs to reprogram all counters on a PMU filter change to provide a sane ABI for userspace. Use the existing KVM_REQ_PMU to do the programming, and take advantage of the fact that the reprogram_pmi bitmap fits in a u64 to set all bits in a single atomic update. Note, setting the bitmap and making the request needs to be done _after_ the SRCU synchronization to ensure that vCPUs will reprogram using the new filter. KVM's current "lazy" approach is confusing and non-deterministic. It's confusing because, from a developer perspective, the code is buggy as it makes zero sense to let userspace modify the filter but then not actually enforce the new filter. The lazy approach is non-deterministic because KVM enforces the filter whenever a counter is reprogrammed, not just on guest WRMSRs, i.e. a guest might gain/lose access to an event at random times depending on what is going on in the host. Note, the resulting behavior is still non-determinstic while the filter is in flux. If userspace wants to guarantee deterministic behavior, all vCPUs should be paused during the filter update. Jim Mattson <[email protected]> Fixes: 66bb8a065f5a ("KVM: x86: PMU Event Filter") Cc: Aaron Lewis <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-11-09KVM: x86/mmu: WARN if TDP MMU SP disallows hugepage after being zappedSean Christopherson1-4/+3
Extend the accounting sanity check in kvm_recover_nx_huge_pages() to the TDP MMU, i.e. verify that zapping a shadow page unaccounts the disallowed NX huge page regardless of the MMU type. Recovery runs while holding mmu_lock for write and so it should be impossible to get false positives on the WARN. Suggested-by: Yan Zhao <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-11-09KVM: x86/mmu: explicitly check nx_hugepage in disallowed_hugepage_adjust()Mingwei Zhang1-1/+2
Explicitly check if a NX huge page is disallowed when determining if a page fault needs to be forced to use a smaller sized page. KVM currently assumes that the NX huge page mitigation is the only scenario where KVM will force a shadow page instead of a huge page, and so unnecessarily keeps an existing shadow page instead of replacing it with a huge page. Any scenario that causes KVM to zap leaf SPTEs may result in having a SP that can be made huge without violating the NX huge page mitigation. E.g. prior to commit 5ba7c4c6d1c7 ("KVM: x86/MMU: Zap non-leaf SPTEs when disabling dirty logging"), KVM would keep shadow pages after disabling dirty logging due to a live migration being canceled, resulting in degraded performance due to running with 4kb pages instead of huge pages. Although the dirty logging case is "fixed", that fix is coincidental, i.e. is an implementation detail, and there are other scenarios where KVM will zap leaf SPTEs. E.g. zapping leaf SPTEs in response to a host page migration (mmu_notifier invalidation) to create a huge page would yield a similar result; KVM would see the shadow-present non-leaf SPTE and assume a huge page is disallowed. Fixes: b8e8c8303ff2 ("kvm: mmu: ITLB_MULTIHIT mitigation") Reviewed-by: Ben Gardon <[email protected]> Reviewed-by: David Matlack <[email protected]> Signed-off-by: Mingwei Zhang <[email protected]> [sean: use spte_to_child_sp(), massage changelog, fold into if-statement] Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Yan Zhao <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-11-09KVM: x86/mmu: Add helper to convert SPTE value to its shadow pageSean Christopherson4-19/+29
Add a helper to convert a SPTE to its shadow page to deduplicate a variety of flows and hopefully avoid future bugs, e.g. if KVM attempts to get the shadow page for a SPTE without dropping high bits. Opportunistically add a comment in mmu_free_root_page() documenting why it treats the root HPA as a SPTE. No functional change intended. Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-11-09KVM: x86/mmu: Track the number of TDP MMU pages, but not the actual pagesSean Christopherson2-19/+12
Track the number of TDP MMU "shadow" pages instead of tracking the pages themselves. With the NX huge page list manipulation moved out of the common linking flow, elminating the list-based tracking means the happy path of adding a shadow page doesn't need to acquire a spinlock and can instead inc/dec an atomic. Keep the tracking as the WARN during TDP MMU teardown on leaked shadow pages is very, very useful for detecting KVM bugs. Tracking the number of pages will also make it trivial to expose the counter to userspace as a stat in the future, which may or may not be desirable. Note, the TDP MMU needs to use a separate counter (and stat if that ever comes to be) from the existing n_used_mmu_pages. The TDP MMU doesn't bother supporting the shrinker nor does it honor KVM_SET_NR_MMU_PAGES (because the TDP MMU consumes so few pages relative to shadow paging), and including TDP MMU pages in that counter would break both the shrinker and shadow MMUs, e.g. if a VM is using nested TDP. Cc: Yan Zhao <[email protected]> Reviewed-by: Mingwei Zhang <[email protected]> Reviewed-by: David Matlack <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Yan Zhao <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-11-09KVM: x86/mmu: Set disallowed_nx_huge_page in TDP MMU before setting SPTESean Christopherson3-25/+39
Set nx_huge_page_disallowed in TDP MMU shadow pages before making the SP visible to other readers, i.e. before setting its SPTE. This will allow KVM to query the flag when determining if a shadow page can be replaced by a NX huge page without violating the rules of the mitigation. Note, the shadow/legacy MMU holds mmu_lock for write, so it's impossible for another CPU to see a shadow page without an up-to-date nx_huge_page_disallowed, i.e. only the TDP MMU needs the complicated dance. Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: David Matlack <[email protected]> Reviewed-by: Yan Zhao <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-11-09KVM: x86/mmu: Properly account NX huge page workaround for nonpaging MMUsSean Christopherson2-1/+13
Account and track NX huge pages for nonpaging MMUs so that a future enhancement to precisely check if a shadow page can't be replaced by a NX huge page doesn't get false positives. Without correct tracking, KVM can get stuck in a loop if an instruction is fetching and writing data on the same huge page, e.g. KVM installs a small executable page on the fetch fault, replaces it with an NX huge page on the write fault, and faults again on the fetch. Alternatively, and perhaps ideally, KVM would simply not enforce the workaround for nonpaging MMUs. The guest has no page tables to abuse and KVM is guaranteed to switch to a different MMU on CR0.PG being toggled so there's no security or performance concerns. However, getting make_spte() to play nice now and in the future is unnecessarily complex. In the current code base, make_spte() can enforce the mitigation if TDP is enabled or the MMU is indirect, but make_spte() may not always have a vCPU/MMU to work with, e.g. if KVM were to support in-line huge page promotion when disabling dirty logging. Without a vCPU/MMU, KVM could either pass in the correct information and/or derive it from the shadow page, but the former is ugly and the latter subtly non-trivial due to the possibility of direct shadow pages in indirect MMUs. Given that using shadow paging with an unpaged guest is far from top priority _and_ has been subjected to the workaround since its inception, keep it simple and just fix the accounting glitch. Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: David Matlack <[email protected]> Reviewed-by: Mingwei Zhang <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-11-09KVM: x86/mmu: Rename NX huge pages fields/functions for consistencySean Christopherson5-51/+71
Rename most of the variables/functions involved in the NX huge page mitigation to provide consistency, e.g. lpage vs huge page, and NX huge vs huge NX, and also to provide clarity, e.g. to make it obvious the flag applies only to the NX huge page mitigation, not to any condition that prevents creating a huge page. Add a comment explaining what the newly named "possible_nx_huge_pages" tracks. Leave the nx_lpage_splits stat alone as the name is ABI and thus set in stone. Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Mingwei Zhang <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-11-09KVM: x86/mmu: Tag disallowed NX huge pages even if they're not trackedSean Christopherson4-13/+39
Tag shadow pages that cannot be replaced with an NX huge page regardless of whether or not zapping the page would allow KVM to immediately create a huge page, e.g. because something else prevents creating a huge page. I.e. track pages that are disallowed from being NX huge pages regardless of whether or not the page could have been huge at the time of fault. KVM currently tracks pages that were disallowed from being huge due to the NX workaround if and only if the page could otherwise be huge. But that fails to handled the scenario where whatever restriction prevented KVM from installing a huge page goes away, e.g. if dirty logging is disabled, the host mapping level changes, etc... Failure to tag shadow pages appropriately could theoretically lead to false negatives, e.g. if a fetch fault requests a small page and thus isn't tracked, and a read/write fault later requests a huge page, KVM will not reject the huge page as it should. To avoid yet another flag, initialize the list_head and use list_empty() to determine whether or not a page is on the list of NX huge pages that should be recovered. Note, the TDP MMU accounting is still flawed as fixing the TDP MMU is more involved due to mmu_lock being held for read. This will be addressed in a future commit. Fixes: 5bcaf3e1715f ("KVM: x86/mmu: Account NX huge page disallowed iff huge page was requested") Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-11-09KVM: x86: Add a VALID_MASK for the flags in kvm_msr_filter_rangeAaron Lewis2-1/+3
Add the mask KVM_MSR_FILTER_RANGE_VALID_MASK for the flags in the struct kvm_msr_filter_range. This simplifies checks that validate these flags, and makes it easier to introduce new flags in the future. No functional change intended. Signed-off-by: Aaron Lewis <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-11-09KVM: x86: Add a VALID_MASK for the flag in kvm_msr_filterAaron Lewis2-1/+2
Add the mask KVM_MSR_FILTER_VALID_MASK for the flag in the struct kvm_msr_filter. This makes it easier to introduce new flags in the future. No functional change intended. Signed-off-by: Aaron Lewis <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-11-09KVM: x86: Add a VALID_MASK for the MSR exit reason flagsAaron Lewis1-3/+1
Add the mask KVM_MSR_EXIT_REASON_VALID_MASK for the MSR exit reason flags. This simplifies checks that validate these flags, and makes it easier to introduce new flags in the future. No functional change intended. Signed-off-by: Aaron Lewis <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-11-09KVM: x86: Disallow the use of KVM_MSR_FILTER_DEFAULT_ALLOW in the kernelAaron Lewis1-0/+2
Protect the kernel from using the flag KVM_MSR_FILTER_DEFAULT_ALLOW. Its value is 0, and using it incorrectly could have unintended consequences. E.g. prevent someone in the kernel from writing something like this. if (filter.flags & KVM_MSR_FILTER_DEFAULT_ALLOW) <allow the MSR> and getting confused when it doesn't work. It would be more ideal to remove this flag altogether, but userspace may already be using it, so protecting the kernel is all that can reasonably be done at this point. Suggested-by: Sean Christopherson <[email protected]> Signed-off-by: Aaron Lewis <[email protected]> Reviewed-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-11-09kvm: x86: Allow to respond to generic signals during slow PFPeter Xu1-3/+13
Enable x86 slow page faults to be able to respond to non-fatal signals, returning -EINTR properly when it happens. Signed-off-by: Peter Xu <[email protected]> Reviewed-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-11-09kvm: Add interruptible flag to __gfn_to_pfn_memslot()Peter Xu1-2/+2
Add a new "interruptible" flag showing that the caller is willing to be interrupted by signals during the __gfn_to_pfn_memslot() request. Wire it up with a FOLL_INTERRUPTIBLE flag that we've just introduced. This prepares KVM to be able to respond to SIGUSR1 (for QEMU that's the SIGIPI) even during e.g. handling an userfaultfd page fault. No functional change intended. Signed-off-by: Peter Xu <[email protected]> Reviewed-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-11-09KVM: x86: smm: preserve interrupt shadow in SMRAMMaxim Levitsky2-6/+31
When #SMI is asserted, the CPU can be in interrupt shadow due to sti or mov ss. It is not mandatory in Intel/AMD prm to have the #SMI blocked during the shadow, and on top of that, since neither SVM nor VMX has true support for SMI window, waiting for one instruction would mean single stepping the guest. Instead, allow #SMI in this case, but both reset the interrupt window and stash its value in SMRAM to restore it on exit from SMM. This fixes rare failures seen mostly on windows guests on VMX, when #SMI falls on the sti instruction which mainfest in VM entry failure due to EFLAGS.IF not being set, but STI interrupt window still being set in the VMCS. Signed-off-by: Maxim Levitsky <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-11-09KVM: x86: SVM: don't save SVM state to SMRAM when VM is not long mode capableMaxim Levitsky1-0/+8
When the guest CPUID doesn't have support for long mode, 32 bit SMRAM layout is used and it has no support for preserving EFER and/or SVM state. Note that this isn't relevant to running 32 bit guests on VM which is long mode capable - such VM can still run 32 bit guests in compatibility mode. Signed-off-by: Maxim Levitsky <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-11-09KVM: x86: SVM: use smram structsMaxim Levitsky2-20/+7
Use SMM structs in the SVM code as well, which removes the last user of put_smstate/GET_SMSTATE so remove these macros as well. Signed-off-by: Maxim Levitsky <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-11-09KVM: svm: drop explicit return value of kvm_vcpu_mapMaxim Levitsky1-4/+3
if kvm_vcpu_map returns non zero value, error path should be triggered regardless of the exact returned error value. Suggested-by: Sean Christopherson <[email protected]> Signed-off-by: Maxim Levitsky <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-11-09KVM: x86: smm: use smram struct for 64 bit smram load/restoreMaxim Levitsky1-90/+63
Use kvm_smram_state_64 struct to save/restore the 64 bit SMM state (used when X86_FEATURE_LM is present in the guest CPUID, regardless of 32-bitness of the guest). Signed-off-by: Maxim Levitsky <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-11-09KVM: x86: smm: use smram struct for 32 bit smram load/restoreMaxim Levitsky1-94/+61
Use kvm_smram_state_32 struct to save/restore 32 bit SMM state (used when X86_FEATURE_LM is not present in the guest CPUID). Signed-off-by: Maxim Levitsky <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-11-09KVM: x86: smm: use smram structs in the common codeMaxim Levitsky4-19/+25
Use kvm_smram union instad of raw arrays in the common smm code. Signed-off-by: Maxim Levitsky <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>