aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2022-01-26pid: Introduce helper task_is_in_init_pid_ns()Leo Yan1-0/+5
Currently the kernel uses open code in multiple places to check if a task is in the root PID namespace with the kind of format: if (task_active_pid_ns(current) == &init_pid_ns) do_something(); This patch creates a new helper function, task_is_in_init_pid_ns(), it returns true if a passed task is in the root PID namespace, otherwise returns false. So it will be used to replace open codes. Suggested-by: Suzuki K Poulose <[email protected]> Signed-off-by: Leo Yan <[email protected]> Reviewed-by: Leon Romanovsky <[email protected]> Acked-by: Suzuki K Poulose <[email protected]> Acked-by: Balbir Singh <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
2022-01-26gve: Fix GFP flags when allocing pagesCatherine Sullivan4-6/+7
Use GFP_ATOMIC when allocating pages out of the hotpath, continue to use GFP_KERNEL when allocating pages during setup. GFP_KERNEL will allow blocking which allows it to succeed more often in a low memory enviornment but in the hotpath we do not want to allow the allocation to block. Fixes: f5cedc84a30d2 ("gve: Add transmit and receive support") Signed-off-by: Catherine Sullivan <[email protected]> Signed-off-by: David Awogbemila <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2022-01-27ata: pata_platform: Fix a NULL pointer dereference in __pata_platform_probe()Zhou Qingyang1-0/+2
In __pata_platform_probe(), devm_kzalloc() is assigned to ap->ops and there is a dereference of it right after that, which could introduce a NULL pointer dereference bug. Fix this by adding a NULL check of ap->ops. This bug was found by a static analyzer. Builds with 'make allyesconfig' show no new warnings, and our static analyzer no longer warns about this code. Fixes: f3d5e4f18dba ("ata: pata_of_platform: Allow to use 16-bit wide data transfer") Signed-off-by: Zhou Qingyang <[email protected]> Signed-off-by: Damien Le Moal <[email protected]> Reviewed-by: Sergey Shtylyov <[email protected]>
2022-01-26ucount: Make get_ucount a safe get_user replacementEric W. Biederman1-0/+2
When the ucount code was refactored to create get_ucount it was missed that some of the contexts in which a rlimit is kept elevated can be the only reference to the user/ucount in the system. Ordinary ucount references exist in places that also have a reference to the user namspace, but in POSIX message queues, the SysV shm code, and the SIGPENDING code there is no independent user namespace reference. Inspection of the the user_namespace show no instance of circular references between struct ucounts and the user_namespace. So hold a reference from struct ucount to i's user_namespace to resolve this problem. Link: https://lore.kernel.org/lkml/[email protected]/ Reported-by: Qian Cai <[email protected]> Reported-by: Mathias Krause <[email protected]> Tested-by: Mathias Krause <[email protected]> Reviewed-by: Mathias Krause <[email protected]> Reviewed-by: Alexey Gladkov <[email protected]> Fixes: d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of ucounts") Fixes: 6e52a9f0532f ("Reimplement RLIMIT_MSGQUEUE on top of ucounts") Fixes: d7c9e99aee48 ("Reimplement RLIMIT_MEMLOCK on top of ucounts") Cc: [email protected] Signed-off-by: "Eric W. Biederman" <[email protected]>
2022-01-27selftests: nft_concat_range: add test for reload with no element add/delFlorian Westphal1-1/+71
Add a specific test for the reload issue fixed with commit 23c54263efd7cb ("netfilter: nft_set_pipapo: allocate pcpu scratch maps on clone"). Add to set, then flush set content + restore without other add/remove in the transaction. On kernels before the fix, this test case fails: net,mac with reload [FAIL] Signed-off-by: Florian Westphal <[email protected]> Reviewed-by: Stefano Brivio <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2022-01-27netfilter: nft_byteorder: track register operationsPablo Neira Ayuso1-0/+12
Cancel tracking for byteorder operation, otherwise selector + byteorder operation is incorrectly reduced if source and destination registers are the same. Reported-by: kernel test robot <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2022-01-27netfilter: nft_reject_bridge: Fix for missing reply from preroutingPhil Sutter1-4/+4
Prior to commit fa538f7cf05aa ("netfilter: nf_reject: add reject skbuff creation helpers"), nft_reject_bridge did not assign to nskb->dev before passing nskb on to br_forward(). The shared skbuff creation helpers introduced in above commit do which seems to confuse br_forward() as reject statements in prerouting hook won't emit a packet anymore. Fix this by simply passing NULL instead of 'dev' to the helpers - they use the pointer for just that assignment, nothing else. Fixes: fa538f7cf05aa ("netfilter: nf_reject: add reject skbuff creation helpers") Signed-off-by: Phil Sutter <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2022-01-27selftests: netfilter: check stateless nat udp checksum fixupFlorian Westphal1-0/+152
Add a test that sends large udp packet (which is fragmented) via a stateless nft nat rule, i.e. 'ip saddr set 10.2.3.4' and check that the datagram is received by peer. On kernels without commit 4e1860a38637 ("netfilter: nft_payload: do not update layer 4 checksum when mangling fragments")', this will fail with: cmp: EOF on /tmp/tmp.V1q0iXJyQF which is empty -rw------- 1 root root 4096 Jan 24 22:03 /tmp/tmp.Aaqnq4rBKS -rw------- 1 root root 0 Jan 24 22:03 /tmp/tmp.V1q0iXJyQF ERROR: in and output file mismatch when checking udp with stateless nat FAIL: nftables v1.0.0 (Fearless Fosdick #2) On patched kernels, this will show: PASS: IP statless for ns2-PFp89amx Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2022-01-27selftests: netfilter: reduce zone stress test running timeFlorian Westphal1-6/+6
This selftests needs almost 3 minutes to complete, reduce the insertes zones to 1000. Test now completes in about 20 seconds. Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2022-01-27netfilter: nft_ct: fix use after free when attaching zone templateFlorian Westphal1-1/+4
The conversion erroneously removed the refcount increment. In case we can use the percpu template, we need to increment the refcount, else it will be released when the skb gets freed. In case the slowpath is taken, the new template already has a refcount of 1. Fixes: 719774377622 ("netfilter: conntrack: convert to refcount_t api") Reported-by: kernel test robot <[email protected]> Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2022-01-27netfilter: Remove flowtable relicsGeert Uytterhoeven4-11/+0
NF_FLOW_TABLE_IPV4 and NF_FLOW_TABLE_IPV6 are invisble, selected by nothing (so they can no longer be enabled), and their last real users have been removed (nf_flow_table_ipv6.c is empty). Clean up the leftovers. Fixes: c42ba4290b2147aa ("netfilter: flowtable: remove ipv4/ipv6 modules") Signed-off-by: Geert Uytterhoeven <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2022-01-26rcu-tasks: Fix computation of CPU-to-list shift countsPaul E. McKenney1-4/+8
The ->percpu_enqueue_shift field is used to map from the running CPU number to the index of the corresponding callback list. This mapping can change at runtime in response to varying callback load, resulting in varying levels of contention on the callback-list locks. Unfortunately, the initial value of this field is correct only if the system happens to have a power-of-two number of CPUs, otherwise the callbacks from the high-numbered CPUs can be placed into the callback list indexed by 1 (rather than 0), and those index-1 callbacks will be ignored. This can result in soft lockups and hangs. This commit therefore corrects this mapping, adding one to this shift count as needed for systems having odd numbers of CPUs. Fixes: 7a30871b6a27 ("rcu-tasks: Introduce ->percpu_enqueue_shift for dynamic queue selection") Reported-by: Andrii Nakryiko <[email protected]> Cc: Reported-by: Martin Lau <[email protected]> Cc: Neeraj Upadhyay <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2022-01-26ceph: set pool_ns in new inode layout for async createsJeff Layton1-0/+7
Dan reported that he was unable to write to files that had been asynchronously created when the client's OSD caps are restricted to a particular namespace. The issue is that the layout for the new inode is only partially being filled. Ensure that we populate the pool_ns_data and pool_ns_len in the iinfo before calling ceph_fill_inode. Cc: [email protected] URL: https://tracker.ceph.com/issues/54013 Fixes: 9a8d03ca2e2c ("ceph: attempt to do async create when possible") Reported-by: Dan van der Ster <[email protected]> Signed-off-by: Jeff Layton <[email protected]> Reviewed-by: Ilya Dryomov <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2022-01-26ceph: properly put ceph_string reference after async create attemptJeff Layton1-0/+2
The reference acquired by try_prep_async_create is currently leaked. Ensure we put it. Cc: [email protected] Fixes: 9a8d03ca2e2c ("ceph: attempt to do async create when possible") Signed-off-by: Jeff Layton <[email protected]> Reviewed-by: Ilya Dryomov <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2022-01-26ceph: put the requests/sessions when it fails to alloc memoryXiubo Li1-18/+37
When failing to allocate the sessions memory we should make sure the req1 and req2 and the sessions get put. And also in case the max_sessions decreased so when kreallocate the new memory some sessions maybe missed being put. And if the max_sessions is 0 krealloc will return ZERO_SIZE_PTR, which will lead to a distinct access fault. URL: https://tracker.ceph.com/issues/53819 Fixes: e1a4541ec0b9 ("ceph: flush the mdlog before waiting on unsafe reqs") Signed-off-by: Xiubo Li <[email protected]> Reviewed-by: Venky Shankar <[email protected]> Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2022-01-26arm64: extable: fix load_unaligned_zeropad() reg indicesEvgenii Stepanov1-2/+2
In ex_handler_load_unaligned_zeropad() we erroneously extract the data and addr register indices from ex->type rather than ex->data. As ex->type will contain EX_TYPE_LOAD_UNALIGNED_ZEROPAD (i.e. 4): * We'll always treat X0 as the address register, since EX_DATA_REG_ADDR is extracted from bits [9:5]. Thus, we may attempt to dereference an arbitrary address as X0 may hold an arbitrary value. * We'll always treat X4 as the data register, since EX_DATA_REG_DATA is extracted from bits [4:0]. Thus we will corrupt X4 and cause arbitrary behaviour within load_unaligned_zeropad() and its caller. Fix this by extracting both values from ex->data as originally intended. On an MTE-enabled QEMU image we are hitting the following crash: Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000 Call trace: fixup_exception+0xc4/0x108 __do_kernel_fault+0x3c/0x268 do_tag_check_fault+0x3c/0x104 do_mem_abort+0x44/0xf4 el1_abort+0x40/0x64 el1h_64_sync_handler+0x60/0xa0 el1h_64_sync+0x7c/0x80 link_path_walk+0x150/0x344 path_openat+0xa0/0x7dc do_filp_open+0xb8/0x168 do_sys_openat2+0x88/0x17c __arm64_sys_openat+0x74/0xa0 invoke_syscall+0x48/0x148 el0_svc_common+0xb8/0xf8 do_el0_svc+0x28/0x88 el0_svc+0x24/0x84 el0t_64_sync_handler+0x88/0xec el0t_64_sync+0x1b4/0x1b8 Code: f8695a69 71007d1f 540000e0 927df12a (f940014a) Fixes: 753b32368705 ("arm64: extable: add load_unaligned_zeropad() handler") Cc: <[email protected]> # 5.16.x Reviewed-by: Mark Rutland <[email protected]> Signed-off-by: Evgenii Stepanov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Catalin Marinas <[email protected]>
2022-01-26counter: fix an IS_ERR() vs NULL bugDan Carpenter1-9/+6
There are 8 callers for devm_counter_alloc() and they all check for NULL instead of error pointers. I think NULL is the better thing to return for allocation functions so update counter_alloc() and devm_counter_alloc() to return NULL instead of error pointers. Fixes: c18e2760308e ("counter: Provide alternative counter registration functions") Acked-by: Uwe Kleine-König <[email protected]> Acked-by: William Breathitt Gray <[email protected]> Signed-off-by: Dan Carpenter <[email protected]> Link: https://lore.kernel.org/r/20220111173243.GA2192@kili Signed-off-by: Greg Kroah-Hartman <[email protected]>
2022-01-26selftests: kvm: move vm_xsave_req_perm call to amx_testPaolo Bonzini5-14/+9
There is no need for tests other than amx_test to enable dynamic xsave states. Remove the call to vm_xsave_req_perm from generic code, and move it inside the test. While at it, allow customizing the bit that is requested, so that future tests can use it differently. Signed-off-by: Paolo Bonzini <[email protected]>
2022-01-26KVM: x86: Sync the states size with the XCR0/IA32_XSS at, any timeLike Xu1-2/+2
XCR0 is reset to 1 by RESET but not INIT and IA32_XSS is zeroed by both RESET and INIT. The kvm_set_msr_common()'s handling of MSR_IA32_XSS also needs to update kvm_update_cpuid_runtime(). In the above cases, the size in bytes of the XSAVE area containing all states enabled by XCR0 or (XCRO | IA32_XSS) needs to be updated. For simplicity and consistency, existing helpers are used to write values and call kvm_update_cpuid_runtime(), and it's not exactly a fast path. Fixes: a554d207dc46 ("KVM: X86: Processor States following Reset or INIT") Cc: [email protected] Signed-off-by: Like Xu <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-01-26KVM: x86: Update vCPU's runtime CPUID on write to MSR_IA32_XSSLike Xu1-0/+1
Do a runtime CPUID update for a vCPU if MSR_IA32_XSS is written, as the size in bytes of the XSAVE area is affected by the states enabled in XSS. Fixes: 203000993de5 ("kvm: vmx: add MSR logic for XSAVES") Cc: [email protected] Signed-off-by: Like Xu <[email protected]> [sean: split out as a separate patch, adjust Fixes tag] Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-01-26KVM: x86: Keep MSR_IA32_XSS unchanged for INITXiaoyao Li1-2/+1
It has been corrected from SDM version 075 that MSR_IA32_XSS is reset to zero on Power up and Reset but keeps unchanged on INIT. Fixes: a554d207dc46 ("KVM: X86: Processor States following Reset or INIT") Cc: [email protected] Signed-off-by: Xiaoyao Li <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-01-26s390/hypfs: include z/VM guests with access control group setVasily Gorbik1-2/+4
Currently if z/VM guest is allowed to retrieve hypervisor performance data globally for all guests (privilege class B) the query is formed in a way to include all guests but the group name is left empty. This leads to that z/VM guests which have access control group set not being included in the results (even local vm). Change the query group identifier from empty to "any" to retrieve information about all guests from any groups (or without a group set). Cc: [email protected] Fixes: 31cb4bd31a48 ("[S390] Hypervisor filesystem (s390_hypfs) for z/VM") Reviewed-by: Gerald Schaefer <[email protected]> Signed-off-by: Vasily Gorbik <[email protected]>
2022-01-26KVM: x86: Free kvm_cpuid_entry2 array on post-KVM_RUN KVM_SET_CPUID{,2}Sean Christopherson1-2/+8
Free the "struct kvm_cpuid_entry2" array on successful post-KVM_RUN KVM_SET_CPUID{,2} to fix a memory leak, the callers of kvm_set_cpuid() free the array only on failure. BUG: memory leak unreferenced object 0xffff88810963a800 (size 2048): comm "syz-executor025", pid 3610, jiffies 4294944928 (age 8.080s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 0d 00 00 00 ................ 47 65 6e 75 6e 74 65 6c 69 6e 65 49 00 00 00 00 GenuntelineI.... backtrace: [<ffffffff814948ee>] kmalloc_node include/linux/slab.h:604 [inline] [<ffffffff814948ee>] kvmalloc_node+0x3e/0x100 mm/util.c:580 [<ffffffff814950f2>] kvmalloc include/linux/slab.h:732 [inline] [<ffffffff814950f2>] vmemdup_user+0x22/0x100 mm/util.c:199 [<ffffffff8109f5ff>] kvm_vcpu_ioctl_set_cpuid2+0x8f/0xf0 arch/x86/kvm/cpuid.c:423 [<ffffffff810711b9>] kvm_arch_vcpu_ioctl+0xb99/0x1e60 arch/x86/kvm/x86.c:5251 [<ffffffff8103e92d>] kvm_vcpu_ioctl+0x4ad/0x950 arch/x86/kvm/../../../virt/kvm/kvm_main.c:4066 [<ffffffff815afacc>] vfs_ioctl fs/ioctl.c:51 [inline] [<ffffffff815afacc>] __do_sys_ioctl fs/ioctl.c:874 [inline] [<ffffffff815afacc>] __se_sys_ioctl fs/ioctl.c:860 [inline] [<ffffffff815afacc>] __x64_sys_ioctl+0xfc/0x140 fs/ioctl.c:860 [<ffffffff844a3335>] do_syscall_x64 arch/x86/entry/common.c:50 [inline] [<ffffffff844a3335>] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 [<ffffffff84600068>] entry_SYSCALL_64_after_hwframe+0x44/0xae Fixes: c6617c61e8fe ("KVM: x86: Partially allow KVM_SET_CPUID{,2} after KVM_RUN") Cc: [email protected] Reported-by: [email protected] Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Reviewed-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-01-26xfs, iomap: limit individual ioend chain lengths in writebackDave Chinner3-5/+65
Trond Myklebust reported soft lockups in XFS IO completion such as this: watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [kworker/12:1:3106] CPU: 12 PID: 3106 Comm: kworker/12:1 Not tainted 4.18.0-305.10.2.el8_4.x86_64 #1 Workqueue: xfs-conv/md127 xfs_end_io [xfs] RIP: 0010:_raw_spin_unlock_irqrestore+0x11/0x20 Call Trace: wake_up_page_bit+0x8a/0x110 iomap_finish_ioend+0xd7/0x1c0 iomap_finish_ioends+0x7f/0xb0 xfs_end_ioend+0x6b/0x100 [xfs] xfs_end_io+0xb9/0xe0 [xfs] process_one_work+0x1a7/0x360 worker_thread+0x1fa/0x390 kthread+0x116/0x130 ret_from_fork+0x35/0x40 Ioends are processed as an atomic completion unit when all the chained bios in the ioend have completed their IO. Logically contiguous ioends can also be merged and completed as a single, larger unit. Both of these things can be problematic as both the bio chains per ioend and the size of the merged ioends processed as a single completion are both unbound. If we have a large sequential dirty region in the page cache, write_cache_pages() will keep feeding us sequential pages and we will keep mapping them into ioends and bios until we get a dirty page at a non-sequential file offset. These large sequential runs can will result in bio and ioend chaining to optimise the io patterns. The pages iunder writeback are pinned within these chains until the submission chaining is broken, allowing the entire chain to be completed. This can result in huge chains being processed in IO completion context. We get deep bio chaining if we have large contiguous physical extents. We will keep adding pages to the current bio until it is full, then we'll chain a new bio to keep adding pages for writeback. Hence we can build bio chains that map millions of pages and tens of gigabytes of RAM if the page cache contains big enough contiguous dirty file regions. This long bio chain pins those pages until the final bio in the chain completes and the ioend can iterate all the chained bios and complete them. OTOH, if we have a physically fragmented file, we end up submitting one ioend per physical fragment that each have a small bio or bio chain attached to them. We do not chain these at IO submission time, but instead we chain them at completion time based on file offset via iomap_ioend_try_merge(). Hence we can end up with unbound ioend chains being built via completion merging. XFS can then do COW remapping or unwritten extent conversion on that merged chain, which involves walking an extent fragment at a time and running a transaction to modify the physical extent information. IOWs, we merge all the discontiguous ioends together into a contiguous file range, only to then process them individually as discontiguous extents. This extent manipulation is computationally expensive and can run in a tight loop, so merging logically contiguous but physically discontigous ioends gains us nothing except for hiding the fact the fact we broke the ioends up into individual physical extents at submission and then need to loop over those individual physical extents at completion. Hence we need to have mechanisms to limit ioend sizes and to break up completion processing of large merged ioend chains: 1. bio chains per ioend need to be bound in length. Pure overwrites go straight to iomap_finish_ioend() in softirq context with the exact bio chain attached to the ioend by submission. Hence the only way to prevent long holdoffs here is to bound ioend submission sizes because we can't reschedule in softirq context. 2. iomap_finish_ioends() has to handle unbound merged ioend chains correctly. This relies on any one call to iomap_finish_ioend() being bound in runtime so that cond_resched() can be issued regularly as the long ioend chain is processed. i.e. this relies on mechanism #1 to limit individual ioend sizes to work correctly. 3. filesystems have to loop over the merged ioends to process physical extent manipulations. This means they can loop internally, and so we break merging at physical extent boundaries so the filesystem can easily insert reschedule points between individual extent manipulations. Signed-off-by: Dave Chinner <[email protected]> Reported-and-tested-by: Trond Myklebust <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2022-01-26KVM: nVMX: WARN on any attempt to allocate shadow VMCS for vmcs02Sean Christopherson1-10/+12
WARN if KVM attempts to allocate a shadow VMCS for vmcs02. KVM emulates VMCS shadowing but doesn't virtualize it, i.e. KVM should never allocate a "real" shadow VMCS for L2. The previous code WARNed but continued anyway with the allocation, presumably in an attempt to avoid NULL pointer dereference. However, alloc_vmcs (and hence alloc_shadow_vmcs) can fail, and indeed the sole caller does: if (enable_shadow_vmcs && !alloc_shadow_vmcs(vcpu)) goto out_shadow_vmcs; which makes it not a useful attempt. Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-01-26KVM: selftests: Don't skip L2's VMCALL in SMM test for SVM guestSean Christopherson1-1/+0
Don't skip the vmcall() in l2_guest_code() prior to re-entering L2, doing so will result in L2 running to completion, popping '0' off the stack for RET, jumping to address '0', and ultimately dying with a triple fault shutdown. It's not at all obvious why the test re-enters L2 and re-executes VMCALL, but presumably it serves a purpose. The VMX path doesn't skip vmcall(), and the test can't possibly have passed on SVM, so just do what VMX does. Fixes: d951b2210c1a ("KVM: selftests: smm_test: Test SMM enter from L2") Cc: Maxim Levitsky <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Reviewed-by: Vitaly Kuznetsov <[email protected]> Tested-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-01-26KVM: x86: Check .flags in kvm_cpuid_check_equal() tooVitaly Kuznetsov1-0/+1
kvm_cpuid_check_equal() checks for the (full) equality of the supplied CPUID data so .flags need to be checked too. Reported-by: Sean Christopherson <[email protected]> Fixes: c6617c61e8fe ("KVM: x86: Partially allow KVM_SET_CPUID{,2} after KVM_RUN") Signed-off-by: Vitaly Kuznetsov <[email protected]> Message-Id: <[email protected]> Cc: [email protected] Signed-off-by: Paolo Bonzini <[email protected]>
2022-01-26KVM: x86: Forcibly leave nested virt when SMM state is toggledSean Christopherson6-7/+12
Forcibly leave nested virtualization operation if userspace toggles SMM state via KVM_SET_VCPU_EVENTS or KVM_SYNC_X86_EVENTS. If userspace forces the vCPU out of SMM while it's post-VMXON and then injects an SMI, vmx_enter_smm() will overwrite vmx->nested.smm.vmxon and end up with both vmxon=false and smm.vmxon=false, but all other nVMX state allocated. Don't attempt to gracefully handle the transition as (a) most transitions are nonsencial, e.g. forcing SMM while L2 is running, (b) there isn't sufficient information to handle all transitions, e.g. SVM wants access to the SMRAM save state, and (c) KVM_SET_VCPU_EVENTS must precede KVM_SET_NESTED_STATE during state restore as the latter disallows putting the vCPU into L2 if SMM is active, and disallows tagging the vCPU as being post-VMXON in SMM if SMM is not active. Abuse of KVM_SET_VCPU_EVENTS manifests as a WARN and memory leak in nVMX due to failure to free vmcs01's shadow VMCS, but the bug goes far beyond just a memory leak, e.g. toggling SMM on while L2 is active puts the vCPU in an architecturally impossible state. WARNING: CPU: 0 PID: 3606 at free_loaded_vmcs arch/x86/kvm/vmx/vmx.c:2665 [inline] WARNING: CPU: 0 PID: 3606 at free_loaded_vmcs+0x158/0x1a0 arch/x86/kvm/vmx/vmx.c:2656 Modules linked in: CPU: 1 PID: 3606 Comm: syz-executor725 Not tainted 5.17.0-rc1-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:free_loaded_vmcs arch/x86/kvm/vmx/vmx.c:2665 [inline] RIP: 0010:free_loaded_vmcs+0x158/0x1a0 arch/x86/kvm/vmx/vmx.c:2656 Code: <0f> 0b eb b3 e8 8f 4d 9f 00 e9 f7 fe ff ff 48 89 df e8 92 4d 9f 00 Call Trace: <TASK> kvm_arch_vcpu_destroy+0x72/0x2f0 arch/x86/kvm/x86.c:11123 kvm_vcpu_destroy arch/x86/kvm/../../../virt/kvm/kvm_main.c:441 [inline] kvm_destroy_vcpus+0x11f/0x290 arch/x86/kvm/../../../virt/kvm/kvm_main.c:460 kvm_free_vcpus arch/x86/kvm/x86.c:11564 [inline] kvm_arch_destroy_vm+0x2e8/0x470 arch/x86/kvm/x86.c:11676 kvm_destroy_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:1217 [inline] kvm_put_kvm+0x4fa/0xb00 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1250 kvm_vm_release+0x3f/0x50 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1273 __fput+0x286/0x9f0 fs/file_table.c:311 task_work_run+0xdd/0x1a0 kernel/task_work.c:164 exit_task_work include/linux/task_work.h:32 [inline] do_exit+0xb29/0x2a30 kernel/exit.c:806 do_group_exit+0xd2/0x2f0 kernel/exit.c:935 get_signal+0x4b0/0x28c0 kernel/signal.c:2862 arch_do_signal_or_restart+0x2a9/0x1c40 arch/x86/kernel/signal.c:868 handle_signal_work kernel/entry/common.c:148 [inline] exit_to_user_mode_loop kernel/entry/common.c:172 [inline] exit_to_user_mode_prepare+0x17d/0x290 kernel/entry/common.c:207 __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline] syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:300 do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86 entry_SYSCALL_64_after_hwframe+0x44/0xae </TASK> Cc: [email protected] Reported-by: [email protected] Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-01-26KVM: SVM: drop unnecessary code in svm_hv_vmcb_dirty_nested_enlightenments()Vitaly Kuznetsov2-13/+1
Commit 3fa5e8fd0a0e4 ("KVM: SVM: delay svm_vcpu_init_msrpm after svm->vmcb is initialized") re-arranged svm_vcpu_init_msrpm() call in svm_create_vcpu(), thus making the comment about vmcb being NULL obsolete. Drop it. While on it, drop superfluous vmcb_is_clean() check: vmcb_mark_dirty() is a bit flip, an extra check is unlikely to bring any performance gain. Drop now-unneeded vmcb_is_clean() helper as well. Fixes: 3fa5e8fd0a0e4 ("KVM: SVM: delay svm_vcpu_init_msrpm after svm->vmcb is initialized") Signed-off-by: Vitaly Kuznetsov <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-01-26KVM: SVM: hyper-v: Enable Enlightened MSR-Bitmap support for realVitaly Kuznetsov1-0/+3
Commit c4327f15dfc7 ("KVM: SVM: hyper-v: Enlightened MSR-Bitmap support") introduced enlightened MSR-Bitmap support for KVM-on-Hyper-V but it didn't actually enable the support. Similar to enlightened NPT TLB flush and direct TLB flush features, the guest (KVM) has to tell L0 (Hyper-V) that it's using the feature by setting the appropriate feature fit in VMCB control area (sw reserved fields). Fixes: c4327f15dfc7 ("KVM: SVM: hyper-v: Enlightened MSR-Bitmap support") Signed-off-by: Vitaly Kuznetsov <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-01-26KVM: SVM: Don't kill SEV guest if SMAP erratum triggers in usermodeSean Christopherson1-1/+15
Inject a #GP instead of synthesizing triple fault to try to avoid killing the guest if emulation of an SEV guest fails due to encountering the SMAP erratum. The injected #GP may still be fatal to the guest, e.g. if the userspace process is providing critical functionality, but KVM should make every attempt to keep the guest alive. Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Liam Merwick <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-01-26KVM: SVM: Don't apply SEV+SMAP workaround on code fetch or PT accessSean Christopherson1-9/+34
Resume the guest instead of synthesizing a triple fault shutdown if the instruction bytes buffer is empty due to the #NPF being on the code fetch itself or on a page table access. The SMAP errata applies if and only if the code fetch was successful and ucode's subsequent data read from the code page encountered a SMAP violation. In practice, the guest is likely hosed either way, but crashing the guest on a code fetch to emulated MMIO is technically wrong according to the behavior described in the APM. Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Liam Merwick <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-01-26KVM: SVM: Inject #UD on attempted emulation for SEV guest w/o insn bufferSean Christopherson1-34/+55
Inject #UD if KVM attempts emulation for an SEV guests without an insn buffer and instruction decoding is required. The previous behavior of allowing emulation if there is no insn buffer is undesirable as doing so means KVM is reading guest private memory and thus decoding cyphertext, i.e. is emulating garbage. The check was previously necessary as the emulation type was not provided, i.e. SVM needed to allow emulation to handle completion of emulation after exiting to userspace to handle I/O. Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Liam Merwick <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-01-26KVM: SVM: WARN if KVM attempts emulation on #UD or #GP for SEV guestsSean Christopherson1-0/+5
WARN if KVM attempts to emulate in response to #UD or #GP for SEV guests, i.e. if KVM intercepts #UD or #GP, as emulation on any fault except #NPF is impossible since KVM cannot read guest private memory to get the code stream, and the CPU's DecodeAssists feature only provides the instruction bytes on #NPF. Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Liam Merwick <[email protected]> Message-Id: <[email protected]> [Warn on EMULTYPE_TRAP_UD_FORCED according to Liam Merwick's review. - Paolo] Signed-off-by: Paolo Bonzini <[email protected]>
2022-01-26KVM: x86: Pass emulation type to can_emulate_instruction()Sean Christopherson4-7/+17
Pass the emulation type to kvm_x86_ops.can_emulate_insutrction() so that a future commit can harden KVM's SEV support to WARN on emulation scenarios that should never happen. No functional change intended. Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Liam Merwick <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-01-26KVM: SVM: Explicitly require DECODEASSISTS to enable SEV supportSean Christopherson1-2/+7
Add a sanity check on DECODEASSIST being support if SEV is supported, as KVM cannot read guest private memory and thus relies on the CPU to provide the instruction byte stream on #NPF for emulation. The intent of the check is to document the dependency, it should never fail in practice as producing hardware that supports SEV but not DECODEASSISTS would be non-sensical. Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Liam Merwick <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-01-26KVM: SVM: Don't intercept #GP for SEV guestsSean Christopherson1-3/+8
Never intercept #GP for SEV guests as reading SEV guest private memory will return cyphertext, i.e. emulating on #GP can't work as intended. Cc: [email protected] Cc: Tom Lendacky <[email protected]> Cc: Brijesh Singh <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Liam Merwick <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-01-26Revert "KVM: SVM: avoid infinite loop on NPF from bad address"Sean Christopherson2-8/+0
Revert a completely broken check on an "invalid" RIP in SVM's workaround for the DecodeAssists SMAP errata. kvm_vcpu_gfn_to_memslot() obviously expects a gfn, i.e. operates in the guest physical address space, whereas RIP is a virtual (not even linear) address. The "fix" worked for the problematic KVM selftest because the test identity mapped RIP. Fully revert the hack instead of trying to translate RIP to a GPA, as the non-SEV case is now handled earlier, and KVM cannot access guest page tables to translate RIP. This reverts commit e72436bc3a5206f95bb384e741154166ddb3202e. Fixes: e72436bc3a52 ("KVM: SVM: avoid infinite loop on NPF from bad address") Reported-by: Liam Merwick <[email protected]> Cc: [email protected] Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Liam Merwick <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-01-26KVM: SVM: Never reject emulation due to SMAP errata for !SEV guestsSean Christopherson1-4/+6
Always signal that emulation is possible for !SEV guests regardless of whether or not the CPU provided a valid instruction byte stream. KVM can read all guest state (memory and registers) for !SEV guests, i.e. can fetch the code stream from memory even if the CPU failed to do so because of the SMAP errata. Fixes: 05d5a4863525 ("KVM: SVM: Workaround errata#1096 (insn_len maybe zero on SMAP violation)") Cc: [email protected] Cc: Tom Lendacky <[email protected]> Cc: Brijesh Singh <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Liam Merwick <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-01-26KVM: x86: nSVM: skip eax alignment check for non-SVM instructionsDenis Valeev1-5/+6
The bug occurs on #GP triggered by VMware backdoor when eax value is unaligned. eax alignment check should not be applied to non-SVM instructions because it leads to incorrect omission of the instructions emulation. Apply the alignment check only to SVM instructions to fix. Fixes: d1cba6c92237 ("KVM: x86: nSVM: test eax for 4K alignment for GP errata workaround") Signed-off-by: Denis Valeev <[email protected]> Message-Id: <Yexlhaoe1Fscm59u@q> Cc: [email protected] Signed-off-by: Paolo Bonzini <[email protected]>
2022-01-26KVM: x86/cpuid: Exclude unpermitted xfeatures sizes at KVM_GET_SUPPORTED_CPUIDLike Xu1-12/+13
With the help of xstate_get_guest_group_perm(), KVM can exclude unpermitted xfeatures in cpuid.0xd.0.eax, in which case the corresponding xfeatures sizes should also be matched to the permitted xfeatures. To fix this inconsistency, the permitted_xcr0 and permitted_xss are defined consistently, which implies 'supported' plus certain permissions for this task, and it also fixes cpuid.0xd.1.ebx and later leaf-by-leaf queries. Fixes: 445ecdf79be0 ("kvm: x86: Exclude unpermitted xfeatures at KVM_GET_SUPPORTED_CPUID") Signed-off-by: Like Xu <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-01-26KVM: LAPIC: Also cancel preemption timer during SET_LAPICWanpeng Li1-1/+1
The below warning is splatting during guest reboot. ------------[ cut here ]------------ WARNING: CPU: 0 PID: 1931 at arch/x86/kvm/x86.c:10322 kvm_arch_vcpu_ioctl_run+0x874/0x880 [kvm] CPU: 0 PID: 1931 Comm: qemu-system-x86 Tainted: G I 5.17.0-rc1+ #5 RIP: 0010:kvm_arch_vcpu_ioctl_run+0x874/0x880 [kvm] Call Trace: <TASK> kvm_vcpu_ioctl+0x279/0x710 [kvm] __x64_sys_ioctl+0x83/0xb0 do_syscall_64+0x3b/0xc0 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7fd39797350b This can be triggered by not exposing tsc-deadline mode and doing a reboot in the guest. The lapic_shutdown() function which is called in sys_reboot path will not disarm the flying timer, it just masks LVTT. lapic_shutdown() clears APIC state w/ LVT_MASKED and timer-mode bit is 0, this can trigger timer-mode switch between tsc-deadline and oneshot/periodic, which can result in preemption timer be cancelled in apic_update_lvtt(). However, We can't depend on this when not exposing tsc-deadline mode and oneshot/periodic modes emulated by preemption timer. Qemu will synchronise states around reset, let's cancel preemption timer under KVM_SET_LAPIC. Signed-off-by: Wanpeng Li <[email protected]> Message-Id: <[email protected]> Cc: [email protected] Signed-off-by: Paolo Bonzini <[email protected]>
2022-01-26KVM: VMX: Remove vmcs_config.orderJim Mattson2-4/+2
The maximum size of a VMCS (or VMXON region) is 4096. By definition, these are order 0 allocations. Signed-off-by: Jim Mattson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-01-26cpuset: Fix the bug that subpart_cpus updated wrongly in update_cpumask()Tianchen Ding1-2/+1
subparts_cpus should be limited as a subset of cpus_allowed, but it is updated wrongly by using cpumask_andnot(). Use cpumask_and() instead to fix it. Fixes: ee8dde0cd2ce ("cpuset: Add new v2 cpuset.sched.partition flag") Signed-off-by: Tianchen Ding <[email protected]> Reviewed-by: Waiman Long <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2022-01-26PCI/sysfs: Find shadow ROM before static attribute initializationBjorn Helgaas3-9/+8
Ville reported that the sysfs "rom" file for VGA devices disappeared after 527139d738d7 ("PCI/sysfs: Convert "rom" to static attribute"). Prior to 527139d738d7, FINAL fixups, including pci_fixup_video() where we find shadow ROMs, were run before pci_create_sysfs_dev_files() created the sysfs "rom" file. After 527139d738d7, "rom" is a static attribute and is created before FINAL fixups are run, so we didn't create "rom" files for shadow ROMs: acpi_pci_root_add ... pci_scan_single_device pci_device_add pci_fixup_video # <-- new HEADER fixup device_add ... if (grp->is_visible()) pci_dev_rom_attr_is_visible # after 527139d738d7 pci_bus_add_devices pci_bus_add_device pci_fixup_device(pci_fixup_final) pci_fixup_video # <-- previous FINAL fixup pci_create_sysfs_dev_files if (pci_resource_len(pdev, PCI_ROM_RESOURCE)) sysfs_create_bin_file("rom") # before 527139d738d7 Change pci_fixup_video() to be a HEADER fixup so it runs before sysfs static attributes are initialized. Rename the Loongson pci_fixup_radeon() to pci_fixup_video() and make its dmesg logging identical to the others since it is doing the same job. Link: https://lore.kernel.org/r/[email protected] Fixes: 527139d738d7 ("PCI/sysfs: Convert "rom" to static attribute") Link: https://lore.kernel.org/r/[email protected] Reported-by: Ville Syrjälä <[email protected]> Tested-by: Ville Syrjälä <[email protected]> Signed-off-by: Bjorn Helgaas <[email protected]> Cc: [email protected] # v5.13+ Cc: Huacai Chen <[email protected]> Cc: Jiaxun Yang <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Krzysztof Wilczyński <[email protected]>
2022-01-26spi: uniphier: fix reference count leak in uniphier_spi_probe()Xin Xiong1-4/+14
The issue happens in several error paths in uniphier_spi_probe(). When either dma_get_slave_caps() or devm_spi_register_master() returns an error code, the function forgets to decrease the refcount of both `dma_rx` and `dma_tx` objects, which may lead to refcount leaks. Fix it by decrementing the reference count of specific objects in those error paths. Signed-off-by: Xin Xiong <[email protected]> Signed-off-by: Xiyu Yang <[email protected]> Signed-off-by: Xin Tan <[email protected]> Reviewed-by: Kunihiko Hayashi <[email protected]> Fixes: 28d1dddc59f6 ("spi: uniphier: Add DMA transfer mode support") Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Mark Brown <[email protected]>
2022-01-26Merge branch 'lan966x-fixes'David S. Miller2-8/+9
Horatiu Vultur says: ==================== net: lan966x: Fixes for sleep in atomic context This patch series contains 2 fixes for lan966x that is sleeping in atomic context. The first patch fixes the injection of the frames while the second one fixes the updating of the MAC table. v1->v2: - correct the fix tag in the second patch, it was using the wrong sha. ==================== Signed-off-by: David S. Miller <[email protected]>
2022-01-26net: lan966x: Fix sleep in atomic context when updating MAC tableHoratiu Vultur1-5/+6
The function lan966x_mac_wait_for_completion is used to poll the status of the MAC table using the function readx_poll_timeout. The problem with this function is that is called also from atomic context. Therefore update the function to use readx_poll_timeout_atomic. Fixes: e18aba8941b40b ("net: lan966x: add mactable support") Signed-off-by: Horatiu Vultur <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2022-01-26net: lan966x: Fix sleep in atomic context when injecting framesHoratiu Vultur1-3/+3
On lan966x, when injecting a frame it was polling the register QS_INJ_STATUS to see if it can continue with the injection of the frame. The problem was that it was using readx_poll_timeout which could sleep in atomic context. This patch fixes this issue by using readx_poll_timeout_atomic. Fixes: d28d6d2e37d10d ("net: lan966x: add port module support") Signed-off-by: Horatiu Vultur <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2022-01-26Merge branch 'dev_addr-const-fixes'David S. Miller6-26/+33
Jakub Kicinski says: ==================== ethernet: fix some esoteric drivers after netdev->dev_addr constification Looking at recent fixes for drivers which don't get included with allmodconfig builds I thought it's worth grepping for more instances of: dev->dev_addr\[.*\] = This set contains the fixes. v2: add last 3 patches which fix drivers for the RiscPC ARM platform. Thanks to Arnd Bergmann for explaining how to build test that. ==================== Signed-off-by: David S. Miller <[email protected]>