aboutsummaryrefslogtreecommitdiff
path: root/arch/x86/include
AgeCommit message (Collapse)AuthorFilesLines
2024-04-30x86/irq: Remove bitfields in posted interrupt descriptorJacob Pan1-9/+12
Mixture of bitfields and types is weird and really not intuitive, remove bitfields and use typed data exclusively. Bitfields often result in inferior machine code. Suggested-by: Sean Christopherson <[email protected]> Suggested-by: Thomas Gleixner <[email protected]> Signed-off-by: Jacob Pan <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected] Link: https://lore.kernel.org/all/20240404101735.402feec8@jacob-builder/T/#mf66e34a82a48f4d8e2926b5581eff59a122de53a
2024-04-30x86/irq: Unionize PID.PIR for 64bit access w/o castingJacob Pan1-1/+4
Make the PIR field into u64 such that atomic xchg64 can be used without ugly casting. Suggested-by: Thomas Gleixner <[email protected]> Signed-off-by: Jacob Pan <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-04-30KVM: VMX: Move posted interrupt descriptor out of VMX codeJacob Pan1-0/+88
To prepare native usage of posted interrupts, move the PID declarations out of VMX code such that they can be shared. Signed-off-by: Jacob Pan <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Sean Christopherson <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-04-29x86/sev: Add callback to apply RMP table fixups for kexecAshish Kalra1-0/+2
Handle cases where the RMP table placement in the BIOS is not 2M aligned and the kexec-ed kernel could try to allocate from within that chunk which then causes a fatal RMP fault. The kexec failure is illustrated below: SEV-SNP: RMP table physical range [0x0000007ffe800000 - 0x000000807f0fffff] BIOS-provided physical RAM map: BIOS-e820: [mem 0x0000000000000000-0x000000000008efff] usable BIOS-e820: [mem 0x000000000008f000-0x000000000008ffff] ACPI NVS ... BIOS-e820: [mem 0x0000004080000000-0x0000007ffe7fffff] usable BIOS-e820: [mem 0x0000007ffe800000-0x000000807f0fffff] reserved BIOS-e820: [mem 0x000000807f100000-0x000000807f1fefff] usable As seen here in the e820 memory map, the end range of the RMP table is not aligned to 2MB and not reserved but it is usable as RAM. Subsequently, kexec -s (KEXEC_FILE_LOAD syscall) loads it's purgatory code and boot_param, command line and other setup data into this RAM region as seen in the kexec logs below, which leads to fatal RMP fault during kexec boot. Loaded purgatory at 0x807f1fa000 Loaded boot_param, command line and misc at 0x807f1f8000 bufsz=0x1350 memsz=0x2000 Loaded 64bit kernel at 0x7ffae00000 bufsz=0xd06200 memsz=0x3894000 Loaded initrd at 0x7ff6c89000 bufsz=0x4176014 memsz=0x4176014 E820 memmap: 0000000000000000-000000000008efff (1) 000000000008f000-000000000008ffff (4) 0000000000090000-000000000009ffff (1) ... 0000004080000000-0000007ffe7fffff (1) 0000007ffe800000-000000807f0fffff (2) 000000807f100000-000000807f1fefff (1) 000000807f1ff000-000000807fffffff (2) nr_segments = 4 segment[0]: buf=0x00000000e626d1a2 bufsz=0x4000 mem=0x807f1fa000 memsz=0x5000 segment[1]: buf=0x0000000029c67bd6 bufsz=0x1350 mem=0x807f1f8000 memsz=0x2000 segment[2]: buf=0x0000000045c60183 bufsz=0xd06200 mem=0x7ffae00000 memsz=0x3894000 segment[3]: buf=0x000000006e54f08d bufsz=0x4176014 mem=0x7ff6c89000 memsz=0x4177000 kexec_file_load: type:0, start:0x807f1fa150 head:0x1184d0002 flags:0x0 Check if RMP table start and end physical range in the e820 tables are not aligned to 2MB and in that case map this range to reserved in all the three e820 tables. [ bp: Massage. ] Fixes: c3b86e61b756 ("x86/cpufeatures: Enable/unmask SEV-SNP CPU feature") Signed-off-by: Ashish Kalra <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Link: https://lore.kernel.org/r/df6e995ff88565262c2c7c69964883ff8aa6fc30.1714090302.git.ashish.kalra@amd.com
2024-04-29x86/e820: Add a new e820 table update helperAshish Kalra1-0/+1
Add a new API helper e820__range_update_table() with which to update an arbitrary e820 table. Move all current users of e820__range_update_kexec() to this new helper. [ bp: Massage. ] Signed-off-by: Ashish Kalra <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Link: https://lore.kernel.org/r/b726af213ad55053f8a7a1e793b01bb3f1ca9dd5.1714090302.git.ashish.kalra@amd.com
2024-04-25x86/mm: implement HAVE_ARCH_UNMAPPED_AREA_VMFLAGSRick Edgecombe1-0/+1
When memory is being placed, mmap() will take care to respect the guard gaps of certain types of memory (VM_SHADOWSTACK, VM_GROWSUP and VM_GROWSDOWN). In order to ensure guard gaps between mappings, mmap() needs to consider two things: 1. That the new mapping isn't placed in an any existing mappings guard gaps. 2. That the new mapping isn't placed such that any existing mappings are not in *its* guard gaps. The longstanding behavior of mmap() is to ensure 1, but not take any care around 2. So for example, if there is a PAGE_SIZE free area, and a mmap() with a PAGE_SIZE size, and a type that has a guard gap is being placed, mmap() may place the shadow stack in the PAGE_SIZE free area. Then the mapping that is supposed to have a guard gap will not have a gap to the adjacent VMA. Add x86 arch implementations of arch_get_unmapped_area_vmflags/_topdown() so future changes can allow the guard gap of type of vma being placed to be taken into account. This will be used for shadow stack memory. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Rick Edgecombe <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Borislav Petkov (AMD) <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Deepak Gupta <[email protected]> Cc: Guo Ren <[email protected]> Cc: Helge Deller <[email protected]> Cc: H. Peter Anvin (Intel) <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Kees Cook <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Mark Brown <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Naveen N. Rao <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25mm/arch: provide pud_pfn() fallbackPeter Xu1-0/+1
The comment in the code explains the reasons. We took a different approach comparing to pmd_pfn() by providing a fallback function. Another option is to provide some lower level config options (compare to HUGETLB_PAGE or THP) to identify which layer an arch can support for such huge mappings. However that can be an overkill. [[email protected]: fix loongson defconfig] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peter Xu <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Tested-by: Ryan Roberts <[email protected]> Cc: Mike Rapoport (IBM) <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Andrew Jones <[email protected]> Cc: Aneesh Kumar K.V (IBM) <[email protected]> Cc: Axel Rasmussen <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: James Houghton <[email protected]> Cc: John Hubbard <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Muchun Song <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25x86: remove unneeded memblock_find_dma_reserve()Baoquan He1-1/+0
Patch series "mm/mm_init.c: refactor free_area_init_core()". In function free_area_init_core(), the code calculating zone->managed_pages and the subtracting dma_reserve from DMA zone looks very confusing. From git history, the code calculating zone->managed_pages was for zone->present_pages originally. The early rough assignment is for optimize zone's pcp and water mark setting. Later, managed_pages was introduced into zone to represent the number of managed pages by buddy. Now, zone->managed_pages is zeroed out and reset in mem_init() when calling memblock_free_all(). zone's pcp and wmark setting relying on actual zone->managed_pages are done later than mem_init() invocation. So we don't need rush to early calculate and set zone->managed_pages, just set it as zone->present_pages, will adjust it in mem_init(). And also add a new function calc_nr_kernel_pages() to count up free but not reserved pages in memblock, then assign it to nr_all_pages and nr_kernel_pages after memmap pages are allocated. This patch (of 6): Variable dma_reserve and its usage was introduced in commit 0e0b864e069c ("[PATCH] Account for memmap and optionally the kernel image as holes"). Its original purpose was to accounting for the reserved pages in DMA zone to make DMA zone's watermarks calculation more accurate on x86. However, currently there's zone->managed_pages to account for all available pages for buddy, zone->present_pages to account for all present physical pages in zone. What is more important, on x86, calculating and setting the zone->managed_pages is a temporary move, all zone's managed_pages will be zeroed out and reset to the actual value according to how many pages are added to buddy allocator in mem_init(). Before mem_init(), no buddy alloction is requested. And zone's pcp and watermark setting are all done after mem_init(). So, no need to worry about the DMA zone's setting accuracy during free_area_init(). Hence, remove memblock_find_dma_reserve() to stop calculating and setting dma_reserve. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Baoquan He <[email protected]> Reviewed-by: Mike Rapoport (IBM) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25fix missing vmalloc.h includesKent Overstreet1-0/+1
Patch series "Memory allocation profiling", v6. Overview: Low overhead [1] per-callsite memory allocation profiling. Not just for debug kernels, overhead low enough to be deployed in production. Example output: root@moria-kvm:~# sort -rn /proc/allocinfo 127664128 31168 mm/page_ext.c:270 func:alloc_page_ext 56373248 4737 mm/slub.c:2259 func:alloc_slab_page 14880768 3633 mm/readahead.c:247 func:page_cache_ra_unbounded 14417920 3520 mm/mm_init.c:2530 func:alloc_large_system_hash 13377536 234 block/blk-mq.c:3421 func:blk_mq_alloc_rqs 11718656 2861 mm/filemap.c:1919 func:__filemap_get_folio 9192960 2800 kernel/fork.c:307 func:alloc_thread_stack_node 4206592 4 net/netfilter/nf_conntrack_core.c:2567 func:nf_ct_alloc_hashtable 4136960 1010 drivers/staging/ctagmod/ctagmod.c:20 [ctagmod] func:ctagmod_start 3940352 962 mm/memory.c:4214 func:alloc_anon_folio 2894464 22613 fs/kernfs/dir.c:615 func:__kernfs_new_node ... Usage: kconfig options: - CONFIG_MEM_ALLOC_PROFILING - CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT - CONFIG_MEM_ALLOC_PROFILING_DEBUG adds warnings for allocations that weren't accounted because of a missing annotation sysctl: /proc/sys/vm/mem_profiling Runtime info: /proc/allocinfo Notes: [1]: Overhead To measure the overhead we are comparing the following configurations: (1) Baseline with CONFIG_MEMCG_KMEM=n (2) Disabled by default (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n) (3) Enabled by default (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=y) (4) Enabled at runtime (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n && /proc/sys/vm/mem_profiling=1) (5) Baseline with CONFIG_MEMCG_KMEM=y && allocating with __GFP_ACCOUNT (6) Disabled by default (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n) && CONFIG_MEMCG_KMEM=y (7) Enabled by default (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=y) && CONFIG_MEMCG_KMEM=y Performance overhead: To evaluate performance we implemented an in-kernel test executing multiple get_free_page/free_page and kmalloc/kfree calls with allocation sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU affinity set to a specific CPU to minimize the noise. Below are results from running the test on Ubuntu 22.04.2 LTS with 6.8.0-rc1 kernel on 56 core Intel Xeon: kmalloc pgalloc (1 baseline) 6.764s 16.902s (2 default disabled) 6.793s (+0.43%) 17.007s (+0.62%) (3 default enabled) 7.197s (+6.40%) 23.666s (+40.02%) (4 runtime enabled) 7.405s (+9.48%) 23.901s (+41.41%) (5 memcg) 13.388s (+97.94%) 48.460s (+186.71%) (6 def disabled+memcg) 13.332s (+97.10%) 48.105s (+184.61%) (7 def enabled+memcg) 13.446s (+98.78%) 54.963s (+225.18%) Memory overhead: Kernel size: text data bss dec diff (1) 26515311 18890222 17018880 62424413 (2) 26524728 19423818 16740352 62688898 264485 (3) 26524724 19423818 16740352 62688894 264481 (4) 26524728 19423818 16740352 62688898 264485 (5) 26541782 18964374 16957440 62463596 39183 Memory consumption on a 56 core Intel CPU with 125GB of memory: Code tags: 192 kB PageExts: 262144 kB (256MB) SlabExts: 9876 kB (9.6MB) PcpuExts: 512 kB (0.5MB) Total overhead is 0.2% of total memory. Benchmarks: Hackbench tests run 100 times: hackbench -s 512 -l 200 -g 15 -f 25 -P baseline disabled profiling enabled profiling avg 0.3543 0.3559 (+0.0016) 0.3566 (+0.0023) stdev 0.0137 0.0188 0.0077 hackbench -l 10000 baseline disabled profiling enabled profiling avg 6.4218 6.4306 (+0.0088) 6.5077 (+0.0859) stdev 0.0933 0.0286 0.0489 stress-ng tests: stress-ng --class memory --seq 4 -t 60 stress-ng --class cpu --seq 4 -t 60 Results posted at: https://evilpiepirate.org/~kent/memalloc_prof_v4_stress-ng/ [2] https://lore.kernel.org/all/[email protected]/ This patch (of 37): The next patch drops vmalloc.h from a system header in order to fix a circular dependency; this adds it to all the files that were pulling it in implicitly. [[email protected]: fix arch/alpha/lib/memcpy.c] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: fix arch/x86/mm/numa_32.c] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: a few places were depending on sizes.h] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: fix mm/kasan/hw_tags.c] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: fix arc build] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kent Overstreet <[email protected]> Signed-off-by: Suren Baghdasaryan <[email protected]> Signed-off-by: Arnd Bergmann <[email protected]> Reviewed-by: Pasha Tatashin <[email protected]> Tested-by: Kees Cook <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Alex Gaynor <[email protected]> Cc: Alice Ryhl <[email protected]> Cc: Andreas Hindborg <[email protected]> Cc: Benno Lossin <[email protected]> Cc: "Björn Roy Baron" <[email protected]> Cc: Boqun Feng <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Dennis Zhou <[email protected]> Cc: Gary Guo <[email protected]> Cc: Miguel Ojeda <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wedson Almeida Filho <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25x86/sev: Shorten struct name snp_secrets_page_layout to snp_secrets_pageTom Lendacky1-1/+1
Ending a struct name with "layout" is a little redundant, so shorten the snp_secrets_page_layout name to just snp_secrets_page. No functional change. [ bp: Rename the local pointer to "secrets" too for more clarity. ] Signed-off-by: Tom Lendacky <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Link: https://lore.kernel.org/r/bc8d58302c6ab66c3beeab50cce3ec2c6bd72d6c.1713974291.git.thomas.lendacky@amd.com
2024-04-24x86/tdx: Preserve shared bit on mprotect()Kirill A. Shutemov2-1/+3
The TDX guest platform takes one bit from the physical address to indicate if the page is shared (accessible by VMM). This bit is not part of the physical_mask and is not preserved during mprotect(). As a result, the 'shared' bit is lost during mprotect() on shared mappings. _COMMON_PAGE_CHG_MASK specifies which PTE bits need to be preserved during modification. AMD includes 'sme_me_mask' in the define to preserve the 'encrypt' bit. To cover both Intel and AMD cases, include 'cc_mask' in _COMMON_PAGE_CHG_MASK instead of 'sme_me_mask'. Reported-and-tested-by: Chris Oo <[email protected]> Fixes: 41394e33f3a0 ("x86/tdx: Extend the confidential computing API to support TDX guests") Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Rick Edgecombe <[email protected]> Reviewed-by: Kuppuswamy Sathyanarayanan <[email protected]> Reviewed-by: Tom Lendacky <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/all/20240424082035.4092071-1-kirill.shutemov%40linux.intel.com
2024-04-24locking/pvqspinlock/x86: Use _Q_LOCKED_VAL in PV_UNLOCK_ASM macroUros Bizjak1-1/+1
Use _Q_LOCKED_VAL instead of hardcoded $0x1 in PV_UNLOCK_ASM macro. No functional changes intended. Signed-off-by: Uros Bizjak <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Acked-by: Boqun Feng <[email protected]> Cc: Waiman Long <[email protected]> Cc: Linus Torvalds <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-04-24locking/qspinlock/x86: Micro-optimize virt_spin_lock()Uros Bizjak1-4/+9
Optimize virt_spin_lock() to use simpler and faster: atomic_try_cmpxchg(*ptr, &val, new) instead of: atomic_cmpxchg(*ptr, val, new) == val The x86 CMPXCHG instruction returns success in the ZF flag, so this change saves a compare after the CMPXCHG. Also optimize retry loop a bit. atomic_try_cmpxchg() fails iff &lock->val != 0, so there is no need to load and compare the lock value again - cpu_relax() can be unconditinally called in this case. This allows us to generate optimized: 1f: ba 01 00 00 00 mov $0x1,%edx 24: 8b 03 mov (%rbx),%eax 26: 85 c0 test %eax,%eax 28: 75 63 jne 8d <...> 2a: f0 0f b1 13 lock cmpxchg %edx,(%rbx) 2e: 75 5d jne 8d <...> ... 8d: f3 90 pause 8f: eb 93 jmp 24 <...> instead of: 1f: ba 01 00 00 00 mov $0x1,%edx 24: 8b 03 mov (%rbx),%eax 26: 85 c0 test %eax,%eax 28: 75 13 jne 3d <...> 2a: f0 0f b1 13 lock cmpxchg %edx,(%rbx) 2e: 85 c0 test %eax,%eax 30: 75 f2 jne 24 <...> ... 3d: f3 90 pause 3f: eb e3 jmp 24 <...> Signed-off-by: Uros Bizjak <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Cc: Waiman Long <[email protected]> Cc: Linus Torvalds <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-04-24locking/atomic/x86: Merge __arch{,_try}_cmpxchg64_emu_local() with ↵Uros Bizjak1-46/+10
__arch{,_try}_cmpxchg64_emu() Macros __arch{,_try}_cmpxchg64_emu() are almost identical to their local variants __arch{,_try}_cmpxchg64_emu_local(), differing only by lock prefixes. Merge these two macros by introducing additional macro parameters to pass lock location and lock prefix from their respective static inline functions. No functional change intended. Signed-off-by: Uros Bizjak <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-04-23crash: add a new kexec flag for hotplug supportSourabh Jain1-9/+2
Commit a72bbec70da2 ("crash: hotplug support for kexec_load()") introduced a new kexec flag, `KEXEC_UPDATE_ELFCOREHDR`. Kexec tool uses this flag to indicate to the kernel that it is safe to modify the elfcorehdr of the kdump image loaded using the kexec_load system call. However, it is possible that architectures may need to update kexec segments other then elfcorehdr. For example, FDT (Flatten Device Tree) on PowerPC. Introducing a new kexec flag for every new kexec segment may not be a good solution. Hence, a generic kexec flag bit, `KEXEC_CRASH_HOTPLUG_SUPPORT`, is introduced to share the CPU/Memory hotplug support intent between the kexec tool and the kernel for the kexec_load system call. Now we have two kexec flags that enables crash hotplug support for kexec_load system call. First is KEXEC_UPDATE_ELFCOREHDR (only used in x86), and second is KEXEC_CRASH_HOTPLUG_SUPPORT (for all architectures). To simplify the process of finding and reporting the crash hotplug support the following changes are introduced. 1. Define arch specific function to process the kexec flags and determine crash hotplug support 2. Rename the @update_elfcorehdr member of struct kimage to @hotplug_support and populate it for both kexec_load and kexec_file_load syscalls, because architecture can update more than one kexec segment 3. Let generic function crash_check_hotplug_support report hotplug support for loaded kdump image based on value of @hotplug_support To bring the x86 crash hotplug support in line with the above points, the following changes have been made: - Introduce the arch_crash_hotplug_support function to process kexec flags and determine crash hotplug support - Remove the arch_crash_hotplug_[cpu|memory]_support functions Signed-off-by: Sourabh Jain <[email protected]> Acked-by: Baoquan He <[email protected]> Acked-by: Hari Bathini <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://msgid.link/[email protected]
2024-04-23crash: forward memory_notify arg to arch crash hotplug handlerSourabh Jain1-1/+1
In the event of memory hotplug or online/offline events, the crash memory hotplug notifier `crash_memhp_notifier()` receives a `memory_notify` object but doesn't forward that object to the generic and architecture-specific crash hotplug handler. The `memory_notify` object contains the starting PFN (Page Frame Number) and the number of pages in the hot-removed memory. This information is necessary for architectures like PowerPC to update/recreate the kdump image, specifically `elfcorehdr`. So update the function signature of `crash_handle_hotplug_event()` and `arch_crash_handle_hotplug_event()` to accept the `memory_notify` object as an argument from crash memory hotplug notifier. Since no such object is available in the case of CPU hotplug event, the crash CPU hotplug notifier `crash_cpuhp_online()` passes NULL to the crash hotplug handler. Signed-off-by: Sourabh Jain <[email protected]> Acked-by: Baoquan He <[email protected]> Acked-by: Hari Bathini <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://msgid.link/[email protected]
2024-04-22x86: Stop using weak symbols for __iowrite32_copy()Jason Gunthorpe1-0/+17
Start switching iomap_copy routines over to use #define and arch provided inline/macro functions instead of weak symbols. Inline functions allow more compiler optimization and this is often a driver hot path. x86 has the only weak implementation for __iowrite32_copy(), so replace it with a static inline containing the same single instruction inline assembly. The compiler will generate the "mov edx,ecx" in a more optimal way. Remove iomap_copy_64.S Link: https://lore.kernel.org/r/[email protected] Acked-by: Arnd Bergmann <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2024-04-22x86/cpu/vfm: Update arch/x86/include/asm/intel-family.hTony Luck1-0/+84
New CPU #defines encode vendor and family as well as model. Update the example usage comment in arch/x86/kernel/cpu/match.c Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-04-22x86/cpu/vfm: Add new macros to work with (vendor/family/model) valuesTony Luck1-0/+93
To avoid adding a slew of new macros for each new Intel CPU family switch over from providing CPU model number #defines to a new scheme that encodes vendor, family, and model in a single number. [ bp: s/casted/cast/g ] Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-04-22x86/cpu/vfm: Add/initialize x86_vfm field to struct cpuinfo_x86Tony Luck1-3/+17
Refactor struct cpuinfo_x86 so that the vendor, family, and model fields are overlaid in a union with a 32-bit field that combines all three (together with a one byte reserved field in the upper byte). This will make it easy, cheap, and reliable to check all three values at once. See https://lore.kernel.org/r/Zgr6kT8oULbnmEXx@agluck-desk3 for why the ordering is (low-to-high bits): (vendor, family, model) [ bp: Move comments over the line, add the backstory about the particular order of the fields. ] Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-04-21Merge tag 'sched_urgent_for_v6.9_rc5' of ↵Linus Torvalds1-0/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Borislav Petkov: - Add a missing memory barrier in the concurrency ID mm switching * tag 'sched_urgent_for_v6.9_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Add missing memory barrier in switch_mm_cid
2024-04-20Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2-0/+2
Pull kvm fixes from Paolo Bonzini: "This is a bit on the large side, mostly due to two changes: - Changes to disable some broken PMU virtualization (see below for details under "x86 PMU") - Clean up SVM's enter/exit assembly code so that it can be compiled without OBJECT_FILES_NON_STANDARD. This fixes a warning "Unpatched return thunk in use. This should not happen!" when running KVM selftests. Everything else is small bugfixes and selftest changes: - Fix a mostly benign bug in the gfn_to_pfn_cache infrastructure where KVM would allow userspace to refresh the cache with a bogus GPA. The bug has existed for quite some time, but was exposed by a new sanity check added in 6.9 (to ensure a cache is either GPA-based or HVA-based). - Drop an unused param from gfn_to_pfn_cache_invalidate_start() that got left behind during a 6.9 cleanup. - Fix a math goof in x86's hugepage logic for KVM_SET_MEMORY_ATTRIBUTES that results in an array overflow (detected by KASAN). - Fix a bug where KVM incorrectly clears root_role.direct when userspace sets guest CPUID. - Fix a dirty logging bug in the where KVM fails to write-protect SPTEs used by a nested guest, if KVM is using Page-Modification Logging and the nested hypervisor is NOT using EPT. x86 PMU: - Drop support for virtualizing adaptive PEBS, as KVM's implementation is architecturally broken without an obvious/easy path forward, and because exposing adaptive PEBS can leak host LBRs to the guest, i.e. can leak host kernel addresses to the guest. - Set the enable bits for general purpose counters in PERF_GLOBAL_CTRL at RESET time, as done by both Intel and AMD processors. - Disable LBR virtualization on CPUs that don't support LBR callstacks, as KVM unconditionally uses PERF_SAMPLE_BRANCH_CALL_STACK when creating the perf event, and would fail on such CPUs. Tests: - Fix a flaw in the max_guest_memory selftest that results in it exhausting the supply of ucall structures when run with more than 256 vCPUs. - Mark KVM_MEM_READONLY as supported for RISC-V in set_memory_region_test" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (30 commits) KVM: Drop unused @may_block param from gfn_to_pfn_cache_invalidate_start() KVM: selftests: Add coverage of EPT-disabled to vmx_dirty_log_test KVM: x86/mmu: Fix and clarify comments about clearing D-bit vs. write-protecting KVM: x86/mmu: Remove function comments above clear_dirty_{gfn_range,pt_masked}() KVM: x86/mmu: Write-protect L2 SPTEs in TDP MMU when clearing dirty status KVM: x86/mmu: Precisely invalidate MMU root_role during CPUID update KVM: VMX: Disable LBR virtualization if the CPU doesn't support LBR callstacks perf/x86/intel: Expose existence of callback support to KVM KVM: VMX: Snapshot LBR capabilities during module initialization KVM: x86/pmu: Do not mask LVTPC when handling a PMI on AMD platforms KVM: x86: Snapshot if a vCPU's vendor model is AMD vs. Intel compatible KVM: x86: Stop compiling vmenter.S with OBJECT_FILES_NON_STANDARD KVM: SVM: Create a stack frame in __svm_sev_es_vcpu_run() KVM: SVM: Save/restore args across SEV-ES VMRUN via host save area KVM: SVM: Save/restore non-volatile GPRs in SEV-ES VMRUN via host save area KVM: SVM: Clobber RAX instead of RBX when discarding spec_ctrl_intercepted KVM: SVM: Drop 32-bit "support" from __svm_sev_es_vcpu_run() KVM: SVM: Wrap __svm_sev_es_vcpu_run() with #ifdef CONFIG_KVM_AMD_SEV KVM: SVM: Create a stack frame in __svm_vcpu_run() for unwinding KVM: SVM: Remove a useless zeroing of allocated memory ...
2024-04-19KVM, x86: add architectural support code for #VEPaolo Bonzini1-0/+12
Dump the contents of the #VE info data structure and assert that #VE does not happen, but do not yet do anything with it. No functional change intended, separated for clarity only. Extracted from a patch by Isaku Yamahata <[email protected]>. Signed-off-by: Paolo Bonzini <[email protected]>
2024-04-19KVM: x86/mmu: Track shadow MMIO value on a per-VM basisSean Christopherson1-0/+2
TDX will use a different shadow PTE entry value for MMIO from VMX. Add a member to kvm_arch and track value for MMIO per-VM instead of a global variable. By using the per-VM EPT entry value for MMIO, the existing VMX logic is kept working. Introduce a separate setter function so that guest TD can use a different value later. Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: Isaku Yamahata <[email protected]> Message-Id: <229a18434e5d83f45b1fcd7bf1544d79db1becb6.1705965635.git.isaku.yamahata@intel.com> Reviewed-by: Xiaoyao Li <[email protected]> Reviewed-by: Binbin Wu <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-04-19KVM: x86/mmu: Add Suppress VE bit to EPT shadow_mmio_mask/shadow_present_maskIsaku Yamahata1-0/+1
To make use of the same value of shadow_mmio_mask and shadow_present_mask for TDX and VMX, add Suppress-VE bit to shadow_mmio_mask and shadow_present_mask so that they can be common for both VMX and TDX. TDX will require shadow_mmio_mask and shadow_present_mask to include VMX_SUPPRESS_VE for shared GPA so that EPT violation is triggered for shared GPA. For VMX, VMX_SUPPRESS_VE doesn't matter for MMIO because the spte value is defined so as to cause EPT misconfig. Signed-off-by: Isaku Yamahata <[email protected]> Message-Id: <97cc616b3563cd8277be91aaeb3e14bce23c3649.1705965635.git.isaku.yamahata@intel.com> Reviewed-by: Xiaoyao Li <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-04-19Merge x86 bugfixes from Linux 6.9-rc3Paolo Bonzini11-17/+48
Pull fix for SEV-SNP late disable bugs. Signed-off-by: Paolo Bonzini <[email protected]>
2024-04-16Merge tag 'kvm-x86-fixes-6.9-rcN' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini1-0/+1
- Fix a mostly benign bug in the gfn_to_pfn_cache infrastructure where KVM would allow userspace to refresh the cache with a bogus GPA. The bug has existed for quite some time, but was exposed by a new sanity check added in 6.9 (to ensure a cache is either GPA-based or HVA-based). - Drop an unused param from gfn_to_pfn_cache_invalidate_start() that got left behind during a 6.9 cleanup. - Disable support for virtualizing adaptive PEBS, as KVM's implementation is architecturally broken and can leak host LBRs to the guest. - Fix a bug where KVM neglects to set the enable bits for general purpose counters in PERF_GLOBAL_CTRL when initializing the virtual PMU. Both Intel and AMD architectures require the bits to be set at RESET in order for v2 PMUs to be backwards compatible with software that was written for v1 PMUs, i.e. for software that will never manually set the global enables. - Disable LBR virtualization on CPUs that don't support LBR callstacks, as KVM unconditionally uses PERF_SAMPLE_BRANCH_CALL_STACK when creating the virtual LBR perf event, i.e. KVM will always fail to create LBR events on such CPUs. - Fix a math goof in x86's hugepage logic for KVM_SET_MEMORY_ATTRIBUTES that results in an array overflow (detected by KASAN). - Fix a flaw in the max_guest_memory selftest that results in it exhausting the supply of ucall structures when run with more than 256 vCPUs. - Mark KVM_MEM_READONLY as supported for RISC-V in set_memory_region_test. - Fix a bug where KVM incorrectly thinks a TDP MMU root is an indirect shadow root due KVM unnecessarily clobbering root_role.direct when userspace sets guest CPUID. - Fix a dirty logging bug in the where KVM fails to write-protect TDP MMU SPTEs used for L2 if Page-Modification Logging is enabled for L1 and the L1 hypervisor is NOT using EPT (if nEPT is enabled, KVM doesn't use the TDP MMU to run L2). For simplicity, KVM always disables PML when running L2, but the TDP MMU wasn't accounting for root-specific conditions that force write- protect based dirty logging.
2024-04-16sched: Add missing memory barrier in switch_mm_cidMathieu Desnoyers1-0/+3
Many architectures' switch_mm() (e.g. arm64) do not have an smp_mb() which the core scheduler code has depended upon since commit: commit 223baf9d17f25 ("sched: Fix performance regression introduced by mm_cid") If switch_mm() doesn't call smp_mb(), sched_mm_cid_remote_clear() can unset the actively used cid when it fails to observe active task after it sets lazy_put. There *is* a memory barrier between storing to rq->curr and _return to userspace_ (as required by membarrier), but the rseq mm_cid has stricter requirements: the barrier needs to be issued between store to rq->curr and switch_mm_cid(), which happens earlier than: - spin_unlock(), - switch_to(). So it's fine when the architecture switch_mm() happens to have that barrier already, but less so when the architecture only provides the full barrier in switch_to() or spin_unlock(). It is a bug in the rseq switch_mm_cid() implementation. All architectures that don't have memory barriers in switch_mm(), but rather have the full barrier either in finish_lock_switch() or switch_to() have them too late for the needs of switch_mm_cid(). Introduce a new smp_mb__after_switch_mm(), defined as smp_mb() in the generic barrier.h header, and use it in switch_mm_cid() for scheduler transitions where switch_mm() is expected to provide a memory barrier. Architectures can override smp_mb__after_switch_mm() if their switch_mm() implementation provides an implicit memory barrier. Override it with a no-op on x86 which implicitly provide this memory barrier by writing to CR3. Fixes: 223baf9d17f2 ("sched: Fix performance regression introduced by mm_cid") Reported-by: levi.yun <[email protected]> Signed-off-by: Mathieu Desnoyers <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Reviewed-by: Catalin Marinas <[email protected]> # for arm64 Acked-by: Dave Hansen <[email protected]> # for x86 Cc: <[email protected]> # 6.4.x Cc: Linus Torvalds <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-04-14locking/atomic/x86: Introduce arch_try_cmpxchg64_local()Uros Bizjak2-0/+40
Introduce arch_try_cmpxchg64_local() for 64-bit and 32-bit targets to improve code using cmpxchg64_local(). On 64-bit targets, the generated assembly improves from: 3e28: 31 c0 xor %eax,%eax 3e2a: 4d 0f b1 7d 00 cmpxchg %r15,0x0(%r13) 3e2f: 48 85 c0 test %rax,%rax 3e32: 0f 85 9f 00 00 00 jne 3ed7 <...> to: 3e28: 31 c0 xor %eax,%eax 3e2a: 4d 0f b1 7d 00 cmpxchg %r15,0x0(%r13) 3e2f: 0f 85 9f 00 00 00 jne 3ed4 <...> where a TEST instruction after CMPXCHG is saved. The improvements for 32-bit targets are even more noticeable, because double-word compare after CMPXCHG8B gets eliminated. Signed-off-by: Uros Bizjak <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Waiman Long <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-04-14x86/pat: Introduce lookup_address_in_pgd_attr()Juergen Gross1-0/+2
Add lookup_address_in_pgd_attr() doing the same as the already existing lookup_address_in_pgd(), but returning the effective settings of the NX and RW bits of all walked page table levels, too. This will be needed in order to match hardware behavior when looking for effective access rights, especially for detecting writable code pages. In order to avoid code duplication, let lookup_address_in_pgd() call lookup_address_in_pgd_attr() with dummy parameters. Signed-off-by: Juergen Gross <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-04-12Merge branch 'x86/urgent' into x86/cpu, to resolve conflictIngo Molnar5-9/+38
There's a new conflict between this commit pending in x86/cpu: 63edbaa48a57 x86/cpu/topology: Add support for the AMD 0x80000026 leaf And these fixes in x86/urgent: c064b536a8f9 x86/cpu/amd: Make the NODEID_MSR union actually work 1b3108f6898e x86/cpu/amd: Make the CPUID 0x80000008 parser correct Resolve them. Conflicts: arch/x86/kernel/cpu/topology_amd.c Signed-off-by: Ingo Molnar <[email protected]>
2024-04-12locking/pvqspinlock/x86: Remove redundant CMP after CMPXCHG in ↵Uros Bizjak1-3/+2
__raw_callee_save___pv_queued_spin_unlock() x86 CMPXCHG instruction returns success in the ZF flag. Remove redundant CMP instruction after CMPXCHG that performs the same check. Also update the function comment to mention the modern version of the equivalent C code. Signed-off-by: Uros Bizjak <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Cc: Linus Torvalds <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-04-12KVM: x86: Split core of hypercall emulation to helper functionSean Christopherson1-0/+4
By necessity, TDX will use a different register ABI for hypercalls. Break out the core functionality so that it may be reused for TDX. Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: Isaku Yamahata <[email protected]> Message-Id: <5134caa55ac3dec33fb2addb5545b52b3b52db02.1705965635.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <[email protected]>
2024-04-11perf/x86/intel: Expose existence of callback support to KVMSean Christopherson1-0/+1
Add a "has_callstack" field to the x86_pmu_lbr structure used to pass information to KVM, and set it accordingly in x86_perf_get_lbr(). KVM will use has_callstack to avoid trying to create perf LBR events with PERF_SAMPLE_BRANCH_CALL_STACK on CPUs that don't support callstacks. Reviewed-by: Mingwei Zhang <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-04-11KVM: SEV: introduce KVM_SEV_INIT2 operationPaolo Bonzini1-0/+9
The idea that no parameter would ever be necessary when enabling SEV or SEV-ES for a VM was decidedly optimistic. In fact, in some sense it's already a parameter whether SEV or SEV-ES is desired. Another possible source of variability is the desired set of VMSA features, as that affects the measurement of the VM's initial state and cannot be changed arbitrarily by the hypervisor. Create a new sub-operation for KVM_MEMORY_ENCRYPT_OP that can take a struct, and put the new op to work by including the VMSA features as a field of the struct. The existing KVM_SEV_INIT and KVM_SEV_ES_INIT use the full set of supported VMSA features for backwards compatibility. The struct also includes the usual bells and whistles for future extensibility: a flags field that must be zero for now, and some padding at the end. Signed-off-by: Paolo Bonzini <[email protected]> Message-ID: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-04-11KVM: SEV: sync FPU and AVX state at LAUNCH_UPDATE_VMSA timePaolo Bonzini1-0/+3
SEV-ES allows passing custom contents for x87, SSE and AVX state into the VMSA. Allow userspace to do that with the usual KVM_SET_XSAVE API and only mark FPU contents as confidential after it has been copied and encrypted into the VMSA. Since the XSAVE state for AVX is the first, it does not need the compacted-state handling of get_xsave_addr(). However, there are other parts of XSAVE state in the VMSA that currently are not handled, and the validation logic of get_xsave_addr() is pointless to duplicate in KVM, so move get_xsave_addr() to public FPU API; it is really just a facility to operate on XSAVE state and does not expose any internal details of arch/x86/kernel/fpu. Acked-by: Dave Hansen <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Message-ID: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-04-11KVM: SEV: define VM types for SEV and SEV-ESPaolo Bonzini1-0/+2
Signed-off-by: Paolo Bonzini <[email protected]> Message-ID: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-04-11KVM: x86: add fields to struct kvm_arch for CoCo featuresPaolo Bonzini1-2/+5
Some VM types have characteristics in common; in fact, the only use of VM types right now is kvm_arch_has_private_mem and it assumes that _all_ nonzero VM types have private memory. We will soon introduce a VM type for SEV and SEV-ES VMs, and at that point we will have two special characteristics of confidential VMs that depend on the VM type: not just if memory is private, but also whether guest state is protected. For the latter we have kvm->arch.guest_state_protected, which is only set on a fully initialized VM. For VM types with protected guest state, we can actually fix a problem in the SEV-ES implementation, where ioctls to set registers do not cause an error even if the VM has been initialized and the guest state encrypted. Make sure that when using VM types that will become an error. Signed-off-by: Paolo Bonzini <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Reviewed-by: Isaku Yamahata <[email protected]> Message-ID: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-04-11KVM: SEV: publish supported VMSA featuresPaolo Bonzini1-2/+7
Compute the set of features to be stored in the VMSA when KVM is initialized; move it from there into kvm_sev_info when SEV is initialized, and then into the initial VMSA. The new variable can then be used to return the set of supported features to userspace, via the KVM_GET_DEVICE_ATTR ioctl. Signed-off-by: Paolo Bonzini <[email protected]> Reviewed-by: Isaku Yamahata <[email protected]> Message-ID: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-04-11KVM: introduce new vendor op for KVM_GET_DEVICE_ATTRPaolo Bonzini2-0/+2
Allow vendor modules to provide their own attributes on /dev/kvm. To avoid proliferation of vendor ops, implement KVM_HAS_DEVICE_ATTR and KVM_GET_DEVICE_ATTR in terms of the same function. You're not supposed to use KVM_GET_DEVICE_ATTR to do complicated computations, especially on /dev/kvm. Reviewed-by: Michael Roth <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Reviewed-by: Isaku Yamahata <[email protected]> Message-ID: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-04-11KVM: x86: Snapshot if a vCPU's vendor model is AMD vs. Intel compatibleSean Christopherson1-0/+1
Add kvm_vcpu_arch.is_amd_compatible to cache if a vCPU's vendor model is compatible with AMD, i.e. if the vCPU vendor is AMD or Hygon, along with helpers to check if a vCPU is compatible AMD vs. Intel. To handle Intel vs. AMD behavior related to masking the LVTPC entry, KVM will need to check for vendor compatibility on every PMI injection, i.e. querying for AMD will soon be a moderately hot path. Note! This subtly (or maybe not-so-subtly) makes "Intel compatible" KVM's default behavior, both if userspace omits (or never sets) CPUID 0x0 and if userspace sets a completely unknown vendor. One could argue that KVM should treat such vCPUs as not being compatible with Intel *or* AMD, but that would add useless complexity to KVM. KVM needs to do *something* in the face of vendor specific behavior, and so unless KVM conjured up a magic third option, choosing to treat unknown vendors as neither Intel nor AMD means that checks on AMD compatibility would yield Intel behavior, and checks for Intel compatibility would yield AMD behavior. And that's far worse as it would effectively yield random behavior depending on whether KVM checked for AMD vs. Intel vs. !AMD vs. !Intel. And practically speaking, all x86 CPUs follow either Intel or AMD architecture, i.e. "supporting" an unknown third architecture adds no value. Deliberately don't convert any of the existing guest_cpuid_is_intel() checks, as the Intel side of things is messier due to some flows explicitly checking for exactly vendor==Intel, versus some flows assuming anything that isn't "AMD compatible" gets Intel behavior. The Intel code will be cleaned up in the future. Cc: [email protected] Signed-off-by: Sean Christopherson <[email protected]> Message-ID: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-04-11Merge tag 'v6.9-rc3' into x86/boot, to pick up fixes before queueing up more ↵Ingo Molnar18-107/+173
changes Signed-off-by: Ingo Molnar <[email protected]>
2024-04-10locking/atomic/x86: Define arch_atomic_sub() family using arch_atomic_add() ↵Uros Bizjak2-20/+4
functions There is no need to implement arch_atomic_sub() family of inline functions, corresponding macros can be directly implemented using arch_atomic_add() inlines with negated argument. No functional changes intended. Signed-off-by: Uros Bizjak <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Cc: Linus Torvalds <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-04-10locking/atomic/x86: Rewrite x86_32 arch_atomic64_{,fetch}_{and,or,xor}() ↵Uros Bizjak1-25/+18
functions Rewrite x86_32 arch_atomic64_{,fetch}_{and,or,xor}() functions to use arch_atomic64_try_cmpxchg(). This implementation avoids one extra trip through the CMPXCHG loop. The value preload before the cmpxchg loop does not need to be atomic. Use arch_atomic64_read_nonatomic(v) to load the value from atomic_t location in a non-atomic way. The generated code improves from: 1917d5: 31 c9 xor %ecx,%ecx 1917d7: 31 db xor %ebx,%ebx 1917d9: 89 4c 24 3c mov %ecx,0x3c(%esp) 1917dd: 8b 74 24 24 mov 0x24(%esp),%esi 1917e1: 89 c8 mov %ecx,%eax 1917e3: 89 5c 24 34 mov %ebx,0x34(%esp) 1917e7: 8b 7c 24 28 mov 0x28(%esp),%edi 1917eb: 21 ce and %ecx,%esi 1917ed: 89 74 24 4c mov %esi,0x4c(%esp) 1917f1: 21 df and %ebx,%edi 1917f3: 89 de mov %ebx,%esi 1917f5: 89 7c 24 50 mov %edi,0x50(%esp) 1917f9: 8b 54 24 4c mov 0x4c(%esp),%edx 1917fd: 8b 7c 24 2c mov 0x2c(%esp),%edi 191801: 8b 4c 24 50 mov 0x50(%esp),%ecx 191805: 89 d3 mov %edx,%ebx 191807: 89 f2 mov %esi,%edx 191809: f0 0f c7 0f lock cmpxchg8b (%edi) 19180d: 89 c1 mov %eax,%ecx 19180f: 8b 74 24 34 mov 0x34(%esp),%esi 191813: 89 d3 mov %edx,%ebx 191815: 89 44 24 4c mov %eax,0x4c(%esp) 191819: 8b 44 24 3c mov 0x3c(%esp),%eax 19181d: 89 df mov %ebx,%edi 19181f: 89 54 24 44 mov %edx,0x44(%esp) 191823: 89 ca mov %ecx,%edx 191825: 31 de xor %ebx,%esi 191827: 31 c8 xor %ecx,%eax 191829: 09 f0 or %esi,%eax 19182b: 75 ac jne 1917d9 <...> to: 1912ba: 8b 06 mov (%esi),%eax 1912bc: 8b 56 04 mov 0x4(%esi),%edx 1912bf: 89 44 24 3c mov %eax,0x3c(%esp) 1912c3: 89 c1 mov %eax,%ecx 1912c5: 23 4c 24 34 and 0x34(%esp),%ecx 1912c9: 89 d3 mov %edx,%ebx 1912cb: 23 5c 24 38 and 0x38(%esp),%ebx 1912cf: 89 54 24 40 mov %edx,0x40(%esp) 1912d3: 89 4c 24 2c mov %ecx,0x2c(%esp) 1912d7: 89 5c 24 30 mov %ebx,0x30(%esp) 1912db: 8b 5c 24 2c mov 0x2c(%esp),%ebx 1912df: 8b 4c 24 30 mov 0x30(%esp),%ecx 1912e3: f0 0f c7 0e lock cmpxchg8b (%esi) 1912e7: 0f 85 f3 02 00 00 jne 1915e0 <...> Signed-off-by: Uros Bizjak <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Cc: Linus Torvalds <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-04-10locking/atomic/x86: Introduce arch_atomic64_read_nonatomic() to x86_32Uros Bizjak1-0/+26
Introduce arch_atomic64_read_nonatomic() for 32-bit targets to load the value from atomic64_t location in a non-atomic way. This function is intended to be used in cases where a subsequent atomic operation will handle the torn value, and can be used to prime the first iteration of unconditional try_cmpxchg() loops. Suggested-by: Mark Rutland <[email protected]> Signed-off-by: Uros Bizjak <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Cc: Linus Torvalds <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-04-10locking/atomic/x86: Introduce arch_atomic64_try_cmpxchg() to x86_32Uros Bizjak1-2/+8
Introduce arch_atomic64_try_cmpxchg() for 32-bit targets to use optimized target specific implementation instead of a generic one. This implementation eliminates dual-word compare after cmpxchg8b instruction and improves generated asm code from: 2273: f0 0f c7 0f lock cmpxchg8b (%edi) 2277: 8b 74 24 2c mov 0x2c(%esp),%esi 227b: 89 d3 mov %edx,%ebx 227d: 89 c2 mov %eax,%edx 227f: 89 5c 24 10 mov %ebx,0x10(%esp) 2283: 8b 7c 24 30 mov 0x30(%esp),%edi 2287: 89 44 24 1c mov %eax,0x1c(%esp) 228b: 31 f2 xor %esi,%edx 228d: 89 d0 mov %edx,%eax 228f: 89 da mov %ebx,%edx 2291: 31 fa xor %edi,%edx 2293: 09 d0 or %edx,%eax 2295: 0f 85 a5 00 00 00 jne 2340 <...> to: 2270: f0 0f c7 0f lock cmpxchg8b (%edi) 2274: 0f 85 a6 00 00 00 jne 2320 <...> Signed-off-by: Uros Bizjak <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Cc: Linus Torvalds <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-04-10Merge branch 'linus' into x86/urgent, to pick up dependent commitsIngo Molnar4-8/+36
Prepare to fix aspects of the new BHI code. Signed-off-by: Ingo Molnar <[email protected]>
2024-04-09KVM: x86: Move nEPT exit_qualification field from kvm_vcpu_arch to x86_exceptionSean Christopherson1-3/+0
Move the exit_qualification field that is used to track information about in-flight nEPT violations from "struct kvm_vcpu_arch" to "x86_exception", i.e. associate the information with the actual nEPT violation instead of the vCPU. To handle bits that are pulled from vmcs.EXIT_QUALIFICATION, i.e. that are propagated from the "original" EPT violation VM-Exit, simply grab them from the VMCS on-demand when injecting a nEPT Violation or a PML Full VM-exit. Aside from being ugly, having an exit_qualification field in kvm_vcpu_arch is outright dangerous, e.g. see commit d7f0a00e438d ("KVM: VMX: Report up-to-date exit qualification to userspace"). Opportunstically add a comment to call out that PML Full and EPT Violation VM-Exits use the same bit to report NMI blocking information. Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-04-09x86/alternatives: Use a temporary buffer when optimizing NOPsBorislav Petkov (AMD)1-1/+1
Instead of optimizing NOPs in-place, use a temporary buffer like the usual alternatives patching flow does. This obviates the need to grab locks when patching, see 6778977590da ("x86/alternatives: Disable interrupts and sync when optimizing NOPs in place") While at it, add nomenclature definitions clarifying and simplifying the naming of function-local variables in the alternatives code. Signed-off-by: Borislav Petkov (AMD) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-04-09x86/alternatives: Catch late X86_FEATURE modifiersBorislav Petkov (AMD)1-2/+6
After alternatives have been patched, changes to the X86_FEATURE flags won't take effect and could potentially even be wrong. Warn about it. This is something which has been long overdue. Signed-off-by: Borislav Petkov (AMD) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Tested-by: Srikanth Aithal <[email protected]> Link: https://lore.kernel.org/r/[email protected]