aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2022-04-22Merge tag '5.18-rc3-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds4-10/+31
Pull cifs fixes from Steve French: "Four fixes, two of them for stable: - fcollapse fix - reconnect lock fix - DFS oops fix - minor cleanup patch" * tag '5.18-rc3-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: destage any unwritten data to the server before calling copychunk_write cifs: use correct lock type in cifs_reconnect() cifs: fix NULL ptr dereference in refresh_mounts() cifs: Use kzalloc instead of kmalloc/memset
2022-04-22Merge tag 'fs.fixes.v5.18-rc4' of ↵Linus Torvalds1-1/+13
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull mount_setattr fix from Christian Brauner: "The recent cleanup in e257039f0fc7 ("mount_setattr(): clean the control flow and calling conventions") switched the mount attribute codepaths from do-while to for loops as they are more idiomatic when walking mounts. However, we did originally choose do-while constructs because if we request a mount or mount tree to be made read-only we need to hold writers in the following way: The mount attribute code will grab lock_mount_hash() and then call mnt_hold_writers() which will _unconditionally_ set MNT_WRITE_HOLD on the mount. Any callers that need write access have to call mnt_want_write(). They will immediately see that MNT_WRITE_HOLD is set on the mount and the caller will then either spin (on non-preempt-rt) or wait on lock_mount_hash() (on preempt-rt). The fact that MNT_WRITE_HOLD is set unconditionally means that once mnt_hold_writers() returns we need to _always_ pair it with mnt_unhold_writers() in both the failure and success paths. The do-while constructs did take care of this. But Al's change to a for loop in the failure path stops on the first mount we failed to change mount attributes _without_ going into the loop to call mnt_unhold_writers(). This in turn means that once we failed to make a mount read-only via mount_setattr() - i.e. there are already writers on that mount - we will block any writers indefinitely. Fix this by ensuring that the for loop always unsets MNT_WRITE_HOLD including the first mount we failed to change to read-only. Also sprinkle a few comments into the cleanup code to remind people about what is happening including myself. After all, I didn't catch it during review. This is only relevant on mainline and was reported by syzbot. Details about the syzbot reports are all in the commit message" * tag 'fs.fixes.v5.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: fs: unset MNT_WRITE_HOLD on failure
2022-04-22Merge tag 'sound-5.18-rc4' of ↵Linus Torvalds35-177/+300
git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound Pull sound fixes from Takashi Iwai: "At this time, the majority of changes are for pending ASoC fixes while a few usual HD-audio and USB-audio quirks are found. Almost all patches are small device-specific fixes, and nothing worrisome stands out, so far" * tag 'sound-5.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (37 commits) ALSA: hda/realtek: Add quirk for Clevo NP70PNP ALSA: hda: intel-dsp-config: Add RaptorLake PCI IDs ALSA: hda/realtek: Enable mute/micmute LEDs and limit mic boost on EliteBook 845/865 G9 ALSA: usb-audio: Clear MIDI port active flag after draining ALSA: usb-audio: add mapping for MSI MAG X570S Torpedo MAX. ALSA: hda/i915: Fix one too many pci_dev_put() ALSA: hda/hdmi: add HDMI codec VID for Raptorlake-P ALSA: hda/hdmi: fix warning about PCM count when used with SOF sound/oss/dmasound: fix 'dmasound_setup' defined but not used firmware: cs_dsp: Fix overrun of unterminated control name string ASoC: codecs: Fix an error handling path in (rx|tx|va)_macro_probe() ASoC: Intel: sof_es8336: Add a quirk for Huawei Matebook D15 ASoC: Intel: sof_es8336: add a quirk for headset at mic1 port ASoC: Intel: sof_es8336: support a separate gpio to control headphone ASoC: Intel: sof_es8336: simplify speaker gpio naming ASoC: wm8731: Disable the regulator when probing fails ASoC: Intel: soc-acpi: correct device endpoints for max98373 ASoC: codecs: wcd934x: do not switch off SIDO Buck when codec is in use ASoC: SOF: topology: Fix memory leak in sof_control_load() ASoC: SOF: topology: cleanup dailinks on widget unload ...
2022-04-22XArray: Disallow sibling entries of nodesMatthew Wilcox (Oracle)1-0/+2
There is a race between xas_split() and xas_load() which can result in the wrong page being returned, and thus data corruption. Fortunately, it's hard to hit (syzbot took three months to find it) and often guarded with VM_BUG_ON(). The anatomy of this race is: thread A thread B order-9 page is stored at index 0x200 lookup of page at index 0x274 page split starts load of sibling entry at offset 9 stores nodes at offsets 8-15 load of entry at offset 8 The entry at offset 8 turns out to be a node, and so we descend into it, and load the page at index 0x234 instead of 0x274. This is hard to fix on the split side; we could replace the entire node that contains the order-9 page instead of replacing the eight entries. Fixing it on the lookup side is easier; just disallow sibling entries that point to nodes. This cannot ever be a useful thing as the descent would not know the correct offset to use within the new node. The test suite continues to pass, but I have not added a new test for this bug. Reported-by: syzbot+cf4cf13056f85dec2c40@syzkaller.appspotmail.com Tested-by: syzbot+cf4cf13056f85dec2c40@syzkaller.appspotmail.com Fixes: 6b24ca4a1a8d ("mm: Use multi-index entries in the page cache") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-04-22tools: Add kmem_cache_alloc_lru()Matthew Wilcox (Oracle)2-2/+9
Turn kmem_cache_alloc() into a wrapper around kmem_cache_alloc_lru(). Fixes: 9bbdc0f32409 ("xarray: use kmem_cache_alloc_lru to allocate xa_node") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reported-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reported-by: Li Wang <liwang@redhat.com>
2022-04-22Merge branch 'akpm' (patches from Andrew)Linus Torvalds18-87/+327
Merge misc fixes from Andrew Morton: "13 patches. Subsystems affected by this patch series: mm (memory-failure, memcg, userfaultfd, hugetlbfs, mremap, oom-kill, kasan, hmm), and kcov" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mm/mmu_notifier.c: fix race in mmu_interval_notifier_remove() kcov: don't generate a warning on vm_insert_page()'s failure MAINTAINERS: add Vincenzo Frascino to KASAN reviewers oom_kill.c: futex: delay the OOM reaper to allow time for proper futex cleanup selftest/vm: add skip support to mremap_test selftest/vm: support xfail in mremap_test selftest/vm: verify remap destination address in mremap_test selftest/vm: verify mmap addr in mremap_test mm, hugetlb: allow for "high" userspace addresses userfaultfd: mark uffd_wp regardless of VM_WRITE flag memcg: sync flush only if periodic flush is delayed mm/memory-failure.c: skip huge_zero_page in memory_failure() mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()
2022-04-22mm/vmalloc: huge vmalloc backing pages should be split rather than compoundNicholas Piggin1-15/+21
Huge vmalloc higher-order backing pages were allocated with __GFP_COMP in order to allow the sub-pages to be refcounted by callers such as "remap_vmalloc_page [sic]" (remap_vmalloc_range). However a similar problem exists for other struct page fields callers use, for example fb_deferred_io_fault() takes a vmalloc'ed page and not only refcounts it but uses ->lru, ->mapping, ->index. This is not compatible with compound sub-pages, and can cause bad page state issues like BUG: Bad page state in process swapper/0 pfn:00743 page:(____ptrval____) refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x743 flags: 0x7ffff000000000(node=0|zone=0|lastcpupid=0x7ffff) raw: 007ffff000000000 c00c00000001d0c8 c00c00000001d0c8 0000000000000000 raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: corrupted mapping in tail page Modules linked in: CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.18.0-rc3-00082-gfc6fff4a7ce1-dirty #2810 Call Trace: dump_stack_lvl+0x74/0xa8 (unreliable) bad_page+0x12c/0x170 free_tail_pages_check+0xe8/0x190 free_pcp_prepare+0x31c/0x4e0 free_unref_page+0x40/0x1b0 __vunmap+0x1d8/0x420 ... The correct approach is to use split high-order pages for the huge vmalloc backing. These allow callers to treat them in exactly the same way as individually-allocated order-0 pages. Link: https://lore.kernel.org/all/14444103-d51b-0fb3-ee63-c3f182f0b546@molgen.mpg.de/ Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Cc: Paul Menzel <pmenzel@molgen.mpg.de> Cc: Song Liu <songliubraving@fb.com> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-22arm64: mm: fix p?d_leaf()Muchun Song1-2/+2
The pmd_leaf() is used to test a leaf mapped PMD, however, it misses the PROT_NONE mapped PMD on arm64. Fix it. A real world issue [1] caused by this was reported by Qian Cai. Also fix pud_leaf(). Link: https://patchwork.kernel.org/comment/24798260/ [1] Fixes: 8aa82df3c123 ("arm64: mm: add p?d_leaf() definitions") Reported-by: Qian Cai <quic_qiancai@quicinc.com> Signed-off-by: Muchun Song <songmuchun@bytedance.com> Link: https://lore.kernel.org/r/20220422060033.48711-1-songmuchun@bytedance.com Signed-off-by: Will Deacon <will@kernel.org>
2022-04-21Merge tag 'drm-fixes-2022-04-22' of git://anongit.freedesktop.org/drm/drmLinus Torvalds3-24/+33
Pull drm fixes from Dave Airlie: "Extra quiet after Easter, only have minor i915 and msm pulls. However I haven't seen a PR from our misc tree in a little while, I've cc'ed all the suspects. Once that unblocks I expect a bit larger bunch of patches to arrive. Otherwise as I said, one msm revert and two i915 fixes. msm: - revert iommu change that broke some platforms. i915: - Unset enable_psr2_sel_fetch if PSR2 detection fails - Fix to detect when VRR is turned off from panel settings" * tag 'drm-fixes-2022-04-22' of git://anongit.freedesktop.org/drm/drm: drm/i915/display/psr: Unset enable_psr2_sel_fetch if other checks in intel_psr2_config_valid() fails drm/msm: Revert "drm/msm: Stop using iommu_present()" drm/i915/display/vrr: Reset VRR capable property on a long hpd
2022-04-21mm/mmu_notifier.c: fix race in mmu_interval_notifier_remove()Alistair Popple1-1/+13
In some cases it is possible for mmu_interval_notifier_remove() to race with mn_tree_inv_end() allowing it to return while the notifier data structure is still in use. Consider the following sequence: CPU0 - mn_tree_inv_end() CPU1 - mmu_interval_notifier_remove() ----------------------------------- ------------------------------------ spin_lock(subscriptions->lock); seq = subscriptions->invalidate_seq; spin_lock(subscriptions->lock); spin_unlock(subscriptions->lock); subscriptions->invalidate_seq++; wait_event(invalidate_seq != seq); return; interval_tree_remove(interval_sub); kfree(interval_sub); spin_unlock(subscriptions->lock); wake_up_all(); As the wait_event() condition is true it will return immediately. This can lead to use-after-free type errors if the caller frees the data structure containing the interval notifier subscription while it is still on a deferred list. Fix this by taking the appropriate lock when reading invalidate_seq to ensure proper synchronisation. I observed this whilst running stress testing during some development. You do have to be pretty unlucky, but it leads to the usual problems of use-after-free (memory corruption, kernel crash, difficult to diagnose WARN_ON, etc). Link: https://lkml.kernel.org/r/20220420043734.476348-1-apopple@nvidia.com Fixes: 99cb252f5e68 ("mm/mmu_notifier: add an interval tree notifier") Signed-off-by: Alistair Popple <apopple@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Christian König <christian.koenig@amd.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-21kcov: don't generate a warning on vm_insert_page()'s failureAleksandr Nogikh1-2/+5
vm_insert_page()'s failure is not an unexpected condition, so don't do WARN_ONCE() in such a case. Instead, print a kernel message and just return an error code. This flaw has been reported under an OOM condition by sysbot [1]. The message is mainly for the benefit of the test log, in this case the fuzzer's log so that humans inspecting the log can figure out what was going on. KCOV is a testing tool, so I think being a little more chatty when KCOV unexpectedly is about to fail will save someone debugging time. We don't want the WARN, because it's not a kernel bug that syzbot should report, and failure can happen if the fuzzer tries hard enough (as above). Link: https://lkml.kernel.org/r/Ylkr2xrVbhQYwNLf@elver.google.com [1] Link: https://lkml.kernel.org/r/20220401182512.249282-1-nogikh@google.com Fixes: b3d7fe86fbd0 ("kcov: properly handle subsequent mmap calls"), Signed-off-by: Aleksandr Nogikh <nogikh@google.com> Acked-by: Marco Elver <elver@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Taras Madan <tarasmadan@google.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-21MAINTAINERS: add Vincenzo Frascino to KASAN reviewersVincenzo Frascino1-0/+1
Add my email address to KASAN reviewers list to make sure that I am Cc'ed in all the KASAN changes that may affect arm64 MTE. Link: https://lkml.kernel.org/r/20220419170640.21404-1-vincenzo.frascino@arm.com Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-21oom_kill.c: futex: delay the OOM reaper to allow time for proper futex cleanupNico Pache2-14/+41
The pthread struct is allocated on PRIVATE|ANONYMOUS memory [1] which can be targeted by the oom reaper. This mapping is used to store the futex robust list head; the kernel does not keep a copy of the robust list and instead references a userspace address to maintain the robustness during a process death. A race can occur between exit_mm and the oom reaper that allows the oom reaper to free the memory of the futex robust list before the exit path has handled the futex death: CPU1 CPU2 -------------------------------------------------------------------- page_fault do_exit "signal" wake_oom_reaper oom_reaper oom_reap_task_mm (invalidates mm) exit_mm exit_mm_release futex_exit_release futex_cleanup exit_robust_list get_user (EFAULT- can't access memory) If the get_user EFAULT's, the kernel will be unable to recover the waiters on the robust_list, leaving userspace mutexes hung indefinitely. Delay the OOM reaper, allowing more time for the exit path to perform the futex cleanup. Reproducer: https://gitlab.com/jsavitz/oom_futex_reproducer Based on a patch by Michal Hocko. Link: https://elixir.bootlin.com/glibc/glibc-2.35/source/nptl/allocatestack.c#L370 [1] Link: https://lkml.kernel.org/r/20220414144042.677008-1-npache@redhat.com Fixes: 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently") Signed-off-by: Joel Savitz <jsavitz@redhat.com> Signed-off-by: Nico Pache <npache@redhat.com> Co-developed-by: Joel Savitz <jsavitz@redhat.com> Suggested-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Rafael Aquini <aquini@redhat.com> Cc: Waiman Long <longman@redhat.com> Cc: Herton R. Krzesinski <herton@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joel Savitz <jsavitz@redhat.com> Cc: Darren Hart <dvhart@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-21selftest/vm: add skip support to mremap_testSidhartha Kumar1-3/+8
Allow the mremap test to be skipped due to errors such as failing to parse the mmap_min_addr sysctl. Link: https://lkml.kernel.org/r/20220420215721.4868-4-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Shuah Khan <skhan@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-21selftest/vm: support xfail in mremap_testSidhartha Kumar1-1/+1
Use ksft_test_result_xfail for the tests which are expected to fail. Link: https://lkml.kernel.org/r/20220420215721.4868-3-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Shuah Khan <skhan@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-21selftest/vm: verify remap destination address in mremap_testSidhartha Kumar1-3/+39
Because mremap does not have a MAP_FIXED_NOREPLACE flag, it can destroy existing mappings. This causes a segfault when regions such as text are remapped and the permissions are changed. Verify the requested mremap destination address does not overlap any existing mappings by using mmap's MAP_FIXED_NOREPLACE flag. Keep incrementing the destination address until a valid mapping is found or fail the current test once the max address is reached. Link: https://lkml.kernel.org/r/20220420215721.4868-2-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Shuah Khan <skhan@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-21selftest/vm: verify mmap addr in mremap_testSidhartha Kumar1-1/+40
Avoid calling mmap with requested addresses that are less than the system's mmap_min_addr. When run as root, mmap returns EACCES when trying to map addresses < mmap_min_addr. This is not one of the error codes for the condition to retry the mmap in the test. Rather than arbitrarily retrying on EACCES, don't attempt an mmap until addr > vm.mmap_min_addr. Add a munmap call after an alignment check as the mappings are retained after the retry and can reach the vm.max_map_count sysctl. Link: https://lkml.kernel.org/r/20220420215721.4868-1-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Shuah Khan <skhan@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-21mm, hugetlb: allow for "high" userspace addressesChristophe Leroy3-12/+13
This is a fix for commit f6795053dac8 ("mm: mmap: Allow for "high" userspace addresses") for hugetlb. This patch adds support for "high" userspace addresses that are optionally supported on the system and have to be requested via a hint mechanism ("high" addr parameter to mmap). Architectures such as powerpc and x86 achieve this by making changes to their architectural versions of hugetlb_get_unmapped_area() function. However, arm64 uses the generic version of that function. So take into account arch_get_mmap_base() and arch_get_mmap_end() in hugetlb_get_unmapped_area(). To allow that, move those two macros out of mm/mmap.c into include/linux/sched/mm.h If these macros are not defined in architectural code then they default to (TASK_SIZE) and (base) so should not introduce any behavioural changes to architectures that do not define them. For the time being, only ARM64 is affected by this change. Catalin (ARM64) said "We should have fixed hugetlb_get_unmapped_area() as well when we added support for 52-bit VA. The reason for commit f6795053dac8 was to prevent normal mmap() from returning addresses above 48-bit by default as some user-space had hard assumptions about this. It's a slight ABI change if you do this for hugetlb_get_unmapped_area() but I doubt anyone would notice. It's more likely that the current behaviour would cause issues, so I'd rather have them consistent. Basically when arm64 gained support for 52-bit addresses we did not want user-space calling mmap() to suddenly get such high addresses, otherwise we could have inadvertently broken some programs (similar behaviour to x86 here). Hence we added commit f6795053dac8. But we missed hugetlbfs which could still get such high mmap() addresses. So in theory that's a potential regression that should have bee addressed at the same time as commit f6795053dac8 (and before arm64 enabled 52-bit addresses)" Link: https://lkml.kernel.org/r/ab847b6edb197bffdfe189e70fb4ac76bfe79e0d.1650033747.git.christophe.leroy@csgroup.eu Fixes: f6795053dac8 ("mm: mmap: Allow for "high" userspace addresses") Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Steve Capper <steve.capper@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: <stable@vger.kernel.org> [5.0.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-21userfaultfd: mark uffd_wp regardless of VM_WRITE flagNadav Amit1-6/+9
When a PTE is set by UFFD operations such as UFFDIO_COPY, the PTE is currently only marked as write-protected if the VMA has VM_WRITE flag set. This seems incorrect or at least would be unexpected by the users. Consider the following sequence of operations that are being performed on a certain page: mprotect(PROT_READ) UFFDIO_COPY(UFFDIO_COPY_MODE_WP) mprotect(PROT_READ|PROT_WRITE) At this point the user would expect to still get UFFD notification when the page is accessed for write, but the user would not get one, since the PTE was not marked as UFFD_WP during UFFDIO_COPY. Fix it by always marking PTEs as UFFD_WP regardless on the write-permission in the VMA flags. Link: https://lkml.kernel.org/r/20220217211602.2769-1-namit@vmware.com Fixes: 292924b26024 ("userfaultfd: wp: apply _PAGE_UFFD_WP bit") Signed-off-by: Nadav Amit <namit@vmware.com> Acked-by: Peter Xu <peterx@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-21memcg: sync flush only if periodic flush is delayedShakeel Butt3-2/+17
Daniel Dao has reported [1] a regression on workloads that may trigger a lot of refaults (anon and file). The underlying issue is that flushing rstat is expensive. Although rstat flush are batched with (nr_cpus * MEMCG_BATCH) stat updates, it seems like there are workloads which genuinely do stat updates larger than batch value within short amount of time. Since the rstat flush can happen in the performance critical codepaths like page faults, such workload can suffer greatly. This patch fixes this regression by making the rstat flushing conditional in the performance critical codepaths. More specifically, the kernel relies on the async periodic rstat flusher to flush the stats and only if the periodic flusher is delayed by more than twice the amount of its normal time window then the kernel allows rstat flushing from the performance critical codepaths. Now the question: what are the side-effects of this change? The worst that can happen is the refault codepath will see 4sec old lruvec stats and may cause false (or missed) activations of the refaulted page which may under-or-overestimate the workingset size. Though that is not very concerning as the kernel can already miss or do false activations. There are two more codepaths whose flushing behavior is not changed by this patch and we may need to come to them in future. One is the writeback stats used by dirty throttling and second is the deactivation heuristic in the reclaim. For now keeping an eye on them and if there is report of regression due to these codepaths, we will reevaluate then. Link: https://lore.kernel.org/all/CA+wXwBSyO87ZX5PVwdHm-=dBjZYECGmfnydUicUyrQqndgX2MQ@mail.gmail.com [1] Link: https://lkml.kernel.org/r/20220304184040.1304781-1-shakeelb@google.com Fixes: 1f828223b799 ("memcg: flush lruvec stats in the refault") Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: Daniel Dao <dqminh@cloudflare.com> Tested-by: Ivan Babrou <ivan@cloudflare.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Frank Hofmann <fhofmann@cloudflare.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-21mm/memory-failure.c: skip huge_zero_page in memory_failure()Xu Yu1-0/+13
Kernel panic when injecting memory_failure for the global huge_zero_page, when CONFIG_DEBUG_VM is enabled, as follows. Injecting memory failure for pfn 0x109ff9 at process virtual address 0x20ff9000 page:00000000fb053fc3 refcount:2 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x109e00 head:00000000fb053fc3 order:9 compound_mapcount:0 compound_pincount:0 flags: 0x17fffc000010001(locked|head|node=0|zone=2|lastcpupid=0x1ffff) raw: 017fffc000010001 0000000000000000 dead000000000122 0000000000000000 raw: 0000000000000000 0000000000000000 00000002ffffffff 0000000000000000 page dumped because: VM_BUG_ON_PAGE(is_huge_zero_page(head)) ------------[ cut here ]------------ kernel BUG at mm/huge_memory.c:2499! invalid opcode: 0000 [#1] PREEMPT SMP PTI CPU: 6 PID: 553 Comm: split_bug Not tainted 5.18.0-rc1+ #11 Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 3288b3c 04/01/2014 RIP: 0010:split_huge_page_to_list+0x66a/0x880 Code: 84 9b fb ff ff 48 8b 7c 24 08 31 f6 e8 9f 5d 2a 00 b8 b8 02 00 00 e9 e8 fb ff ff 48 c7 c6 e8 47 3c 82 4c b RSP: 0018:ffffc90000dcbdf8 EFLAGS: 00010246 RAX: 000000000000003c RBX: 0000000000000001 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffffff823e4c4f RDI: 00000000ffffffff RBP: ffff88843fffdb40 R08: 0000000000000000 R09: 00000000fffeffff R10: ffffc90000dcbc48 R11: ffffffff82d68448 R12: ffffea0004278000 R13: ffffffff823c6203 R14: 0000000000109ff9 R15: ffffea000427fe40 FS: 00007fc375a26740(0000) GS:ffff88842fd80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fc3757c9290 CR3: 0000000102174006 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: try_to_split_thp_page+0x3a/0x130 memory_failure+0x128/0x800 madvise_inject_error.cold+0x8b/0xa1 __x64_sys_madvise+0x54/0x60 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7fc3754f8bf9 Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 8 RSP: 002b:00007ffeda93a1d8 EFLAGS: 00000217 ORIG_RAX: 000000000000001c RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc3754f8bf9 RDX: 0000000000000064 RSI: 0000000000003000 RDI: 0000000020ff9000 RBP: 00007ffeda93a200 R08: 0000000000000000 R09: 0000000000000000 R10: 00000000ffffffff R11: 0000000000000217 R12: 0000000000400490 R13: 00007ffeda93a2e0 R14: 0000000000000000 R15: 0000000000000000 This makes huge_zero_page bail out explicitly before split in memory_failure(), thus the panic above won't happen again. Link: https://lkml.kernel.org/r/497d3835612610e370c74e697ea3c721d1d55b9c.1649775850.git.xuyu@linux.alibaba.com Fixes: 6a46079cf57a ("HWPOISON: The high level memory error handler in the VM v7") Signed-off-by: Xu Yu <xuyu@linux.alibaba.com> Reported-by: Abaci <abaci@linux.alibaba.com> Suggested-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-21mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()Naoya Horiguchi4-42/+127
There is a race condition between memory_failure_hugetlb() and hugetlb free/demotion, which causes setting PageHWPoison flag on the wrong page. The one simple result is that wrong processes can be killed, but another (more serious) one is that the actual error is left unhandled, so no one prevents later access to it, and that might lead to more serious results like consuming corrupted data. Think about the below race window: CPU 1 CPU 2 memory_failure_hugetlb struct page *head = compound_head(p); hugetlb page might be freed to buddy, or even changed to another compound page. get_hwpoison_page -- page is not what we want now... The current code first does prechecks roughly and then reconfirms after taking refcount, but it's found that it makes code overly complicated, so move the prechecks in a single hugetlb_lock range. A newly introduced function, try_memory_failure_hugetlb(), always takes hugetlb_lock (even for non-hugetlb pages). That can be improved, but memory_failure() is rare in principle, so should not be a big problem. Link: https://lkml.kernel.org/r/20220408135323.1559401-2-naoya.horiguchi@linux.dev Fixes: 761ad8d7c7b5 ("mm: hwpoison: introduce memory_failure_hugetlb()") Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reported-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-22ata: pata_marvell: Check the 'bmdma_addr' beforing readingZheyu Ma1-0/+2
Before detecting the cable type on the dma bar, the driver should check whether the 'bmdma_addr' is zero, which means the adapter does not support DMA, otherwise we will get the following error: [ 5.146634] Bad IO access at port 0x1 (return inb(port)) [ 5.147206] WARNING: CPU: 2 PID: 303 at lib/iomap.c:44 ioread8+0x4a/0x60 [ 5.150856] RIP: 0010:ioread8+0x4a/0x60 [ 5.160238] Call Trace: [ 5.160470] <TASK> [ 5.160674] marvell_cable_detect+0x6e/0xc0 [pata_marvell] [ 5.161728] ata_eh_recover+0x3520/0x6cc0 [ 5.168075] ata_do_eh+0x49/0x3c0 Signed-off-by: Zheyu Ma <zheyuma97@gmail.com> Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
2022-04-22Merge tag 'drm-msm-fixes-2022-04-20' of ↵Dave Airlie1-1/+1
https://gitlab.freedesktop.org/drm/msm into drm-fixes Revert to fix iommu regression. Signed-off-by: Dave Airlie <airlied@redhat.com> From: Rob Clark <robdclark@gmail.com> Link: https://patchwork.freedesktop.org/patch/msgid/CAF6AEGtvPo4xD2peAztDMPP2n4utb7d9WQboMFwsba9E8U2rCw@mail.gmail.com
2022-04-21Merge tag 'dmaengine-fix-5.18' of ↵Linus Torvalds8-34/+53
git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine Pull dmaengine fixes from Vinod Koul: "A bunch of driver fixes: - idxd device RO checks and device cleanup - dw-edma unaligned access and alignment - qcom: missing minItems in binding - mediatek pm usage fix - imx init script" * tag 'dmaengine-fix-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine: dt-bindings: dmaengine: qcom: gpi: Add minItems for interrupts dmaengine: idxd: skip clearing device context when device is read-only dmaengine: idxd: add RO check for wq max_transfer_size write dmaengine: idxd: add RO check for wq max_batch_size write dmaengine: idxd: fix retry value to be constant for duration of function call dmaengine: idxd: match type for retries var in idxd_enqcmds() dmaengine: dw-edma: Fix inconsistent indenting dmaengine: dw-edma: Fix unaligned 64bit access dmaengine: mediatek:Fix PM usage reference leak of mtk_uart_apdma_alloc_chan_resources dmaengine: imx-sdma: Fix error checking in sdma_event_remap dma: at_xdmac: fix a missing check on list iterator dmaengine: imx-sdma: fix init of uart scripts dmaengine: idxd: fix device cleanup on disable
2022-04-21RISC-V: cpuidle: fix Kconfig select for RISCV_SBI_CPUIDLERandy Dunlap1-1/+1
There can be lots of build errors when building cpuidle-riscv-sbi.o. They are all caused by a kconfig problem with this warning: WARNING: unmet direct dependencies detected for RISCV_SBI_CPUIDLE Depends on [n]: CPU_IDLE [=y] && RISCV [=y] && RISCV_SBI [=n] Selected by [y]: - SOC_VIRT [=y] && CPU_IDLE [=y] so make the 'select' of RISCV_SBI_CPUIDLE also depend on RISCV_SBI. Fixes: c5179ef1ca0c ("RISC-V: Enable RISC-V SBI CPU Idle driver for QEMU virt machine") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reported-by: kernel test robot <lkp@intel.com> Reviewed-by: Anup Patel <anup@brainfault.org> Cc: stable@vger.kernel.org Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2022-04-21RISC-V: mm: Fix set_satp_mode() for platform not having Sv57Anup Patel1-0/+1
When Sv57 is not available the satp.MODE test in set_satp_mode() will fail and lead to pgdir re-programming for Sv48. The pgdir re-programming will fail as well due to pre-existing pgdir entry used for Sv57 and as a result kernel fails to boot on RISC-V platform not having Sv57. To fix above issue, we should clear the pgdir memory in set_satp_mode() before re-programming. Fixes: 011f09d12052 ("riscv: mm: Set sv57 on defaultly") Reported-by: Mayuresh Chitale <mchitale@ventanamicro.com> Signed-off-by: Anup Patel <apatel@ventanamicro.com> Reviewed-by: Atish Patra <atishp@rivosinc.com> Cc: stable@vger.kernel.org Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2022-04-22Merge tag 'drm-intel-fixes-2022-04-20' of ↵Dave Airlie2-23/+32
git://anongit.freedesktop.org/drm/drm-intel into drm-fixes - Unset enable_psr2_sel_fetch if PSR2 detection fails - Fix to detect when VRR is turned off from panel settings Signed-off-by: Dave Airlie <airlied@redhat.com> From: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/YmAKuHwon7hGyIoC@jlahtine-mobl.ger.corp.intel.com
2022-04-21kvm: selftests: introduce and use more page size-related constantsPaolo Bonzini8-13/+8
Clean up code that was hardcoding masks for various fields, now that the masks are included in processor.h. For more cleanup, define PAGE_SIZE and PAGE_MASK just like in Linux. PAGE_SIZE in particular was defined by several tests. Suggested-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-21kvm: selftests: do not use bitfields larger than 32-bits for PTEsPaolo Bonzini2-115/+92
Red Hat's QE team reported test failure on access_tracking_perf_test: Testing guest mode: PA-bits:ANY, VA-bits:48, 4K pages guest physical test memory offset: 0x3fffbffff000 Populating memory : 0.684014577s Writing to populated memory : 0.006230175s Reading from populated memory : 0.004557805s ==== Test Assertion Failure ==== lib/kvm_util.c:1411: false pid=125806 tid=125809 errno=4 - Interrupted system call 1 0x0000000000402f7c: addr_gpa2hva at kvm_util.c:1411 2 (inlined by) addr_gpa2hva at kvm_util.c:1405 3 0x0000000000401f52: lookup_pfn at access_tracking_perf_test.c:98 4 (inlined by) mark_vcpu_memory_idle at access_tracking_perf_test.c:152 5 (inlined by) vcpu_thread_main at access_tracking_perf_test.c:232 6 0x00007fefe9ff81ce: ?? ??:0 7 0x00007fefe9c64d82: ?? ??:0 No vm physical memory at 0xffbffff000 I can easily reproduce it with a Intel(R) Xeon(R) CPU E5-2630 with 46 bits PA. It turns out that the address translation for clearing idle page tracking returned a wrong result; addr_gva2gpa()'s last step, which is based on "pte[index[0]].pfn", did the calculation with 40 bits length and the high 12 bits got truncated. In above case the GPA address to be returned should be 0x3fffbffff000 for GVA 0xc0000000, but it got truncated into 0xffbffff000 and the subsequent gpa2hva lookup failed. The width of operations on bit fields greater than 32-bit is implementation defined, and differs between GCC (which uses the bitfield precision) and clang (which uses 64-bit arithmetic), so this is a potential minefield. Remove the bit fields and using manual masking instead. Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2075036 Reported-by: Nana Liu <nanliu@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Tested-by: Peter Xu <peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-21KVM: SEV: add cache flush to solve SEV cache incoherency issuesMingwei Zhang8-3/+44
Flush the CPU caches when memory is reclaimed from an SEV guest (where reclaim also includes it being unmapped from KVM's memslots). Due to lack of coherency for SEV encrypted memory, failure to flush results in silent data corruption if userspace is malicious/broken and doesn't ensure SEV guest memory is properly pinned and unpinned. Cache coherency is not enforced across the VM boundary in SEV (AMD APM vol.2 Section 15.34.7). Confidential cachelines, generated by confidential VM guests have to be explicitly flushed on the host side. If a memory page containing dirty confidential cachelines was released by VM and reallocated to another user, the cachelines may corrupt the new user at a later time. KVM takes a shortcut by assuming all confidential memory remain pinned until the end of VM lifetime. Therefore, KVM does not flush cache at mmu_notifier invalidation events. Because of this incorrect assumption and the lack of cache flushing, malicous userspace can crash the host kernel: creating a malicious VM and continuously allocates/releases unpinned confidential memory pages when the VM is running. Add cache flush operations to mmu_notifier operations to ensure that any physical memory leaving the guest VM get flushed. In particular, hook mmu_notifier_invalidate_range_start and mmu_notifier_release events and flush cache accordingly. The hook after releasing the mmu lock to avoid contention with other vCPUs. Cc: stable@vger.kernel.org Suggested-by: Sean Christpherson <seanjc@google.com> Reported-by: Mingwei Zhang <mizhang@google.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> Message-Id: <20220421031407.2516575-4-mizhang@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-21Merge tag 'net-5.18-rc4' of ↵Linus Torvalds40-83/+210
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from xfrm and can. Current release - regressions: - rxrpc: restore removed timer deletion Current release - new code bugs: - gre: fix device lookup for l3mdev use-case - xfrm: fix egress device lookup for l3mdev use-case Previous releases - regressions: - sched: cls_u32: fix netns refcount changes in u32_change() - smc: fix sock leak when release after smc_shutdown() - xfrm: limit skb_page_frag_refill use to a single page - eth: atlantic: invert deep par in pm functions, preventing null derefs - eth: stmmac: use readl_poll_timeout_atomic() in atomic state Previous releases - always broken: - gre: fix skb_under_panic on xmit - openvswitch: fix OOB access in reserve_sfa_size() - dsa: hellcreek: calculate checksums in tagger - eth: ice: fix crash in switchdev mode - eth: igc: - fix infinite loop in release_swfw_sync - fix scheduling while atomic" * tag 'net-5.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (37 commits) drivers: net: hippi: Fix deadlock in rr_close() selftests: mlxsw: vxlan_flooding_ipv6: Prevent flooding of unwanted packets selftests: mlxsw: vxlan_flooding: Prevent flooding of unwanted packets nfc: MAINTAINERS: add Bug entry net: stmmac: Use readl_poll_timeout_atomic() in atomic state doc/ip-sysctl: add bc_forwarding netlink: reset network and mac headers in netlink_dump() net: mscc: ocelot: fix broken IP multicast flooding net: dsa: hellcreek: Calculate checksums in tagger net: atlantic: invert deep par in pm functions, preventing null derefs can: isotp: stop timeout monitoring when no first frame was sent bonding: do not discard lowest hash bit for non layer3+4 hashing net: lan966x: Make sure to release ptp interrupt ipv6: make ip6_rt_gc_expire an atomic_t net: Handle l3mdev in ip_tunnel_init_flow l3mdev: l3mdev_master_upper_ifindex_by_index_rcu should be using netdev_master_upper_dev_get_rcu net/sched: cls_u32: fix possible leak in u32_init_knode() net/sched: cls_u32: fix netns refcount changes in u32_change() powerpc: Update MAINTAINERS for ibmvnic and VAS net: restore alpha order to Ethernet devices in config ...
2022-04-21ALSA: hda/realtek: Add quirk for Clevo NP70PNPTim Crawford1-0/+1
Fixes headset detection on Clevo NP70PNP. Signed-off-by: Tim Crawford <tcrawford@system76.com> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20220421170412.3697-1-tcrawford@system76.com Signed-off-by: Takashi Iwai <tiwai@suse.de>
2022-04-21ALSA: hda: intel-dsp-config: Add RaptorLake PCI IDsGongjun Song1-0/+9
Add RaptorLake-P PCI IDs Reviewed-by: Kai Vehmanen <kai.vehmanen@linux.intel.com> Signed-off-by: Gongjun Song <gongjun.song@intel.com> Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Link: https://lore.kernel.org/r/20220421163546.319604-1-pierre-louis.bossart@linux.intel.com Signed-off-by: Takashi Iwai <tiwai@suse.de>
2022-04-21jbd2: fix a potential race while discarding reserved buffers after an abortYe Bin1-1/+3
we got issue as follows: [ 72.796117] EXT4-fs error (device sda): ext4_journal_check_start:83: comm fallocate: Detected aborted journal [ 72.826847] EXT4-fs (sda): Remounting filesystem read-only fallocate: fallocate failed: Read-only file system [ 74.791830] jbd2_journal_commit_transaction: jh=0xffff9cfefe725d90 bh=0x0000000000000000 end delay [ 74.793597] ------------[ cut here ]------------ [ 74.794203] kernel BUG at fs/jbd2/transaction.c:2063! [ 74.794886] invalid opcode: 0000 [#1] PREEMPT SMP PTI [ 74.795533] CPU: 4 PID: 2260 Comm: jbd2/sda-8 Not tainted 5.17.0-rc8-next-20220315-dirty #150 [ 74.798327] RIP: 0010:__jbd2_journal_unfile_buffer+0x3e/0x60 [ 74.801971] RSP: 0018:ffffa828c24a3cb8 EFLAGS: 00010202 [ 74.802694] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 [ 74.803601] RDX: 0000000000000001 RSI: ffff9cfefe725d90 RDI: ffff9cfefe725d90 [ 74.804554] RBP: ffff9cfefe725d90 R08: 0000000000000000 R09: ffffa828c24a3b20 [ 74.805471] R10: 0000000000000001 R11: 0000000000000001 R12: ffff9cfefe725d90 [ 74.806385] R13: ffff9cfefe725d98 R14: 0000000000000000 R15: ffff9cfe833a4d00 [ 74.807301] FS: 0000000000000000(0000) GS:ffff9d01afb00000(0000) knlGS:0000000000000000 [ 74.808338] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 74.809084] CR2: 00007f2b81bf4000 CR3: 0000000100056000 CR4: 00000000000006e0 [ 74.810047] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 74.810981] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 74.811897] Call Trace: [ 74.812241] <TASK> [ 74.812566] __jbd2_journal_refile_buffer+0x12f/0x180 [ 74.813246] jbd2_journal_refile_buffer+0x4c/0xa0 [ 74.813869] jbd2_journal_commit_transaction.cold+0xa1/0x148 [ 74.817550] kjournald2+0xf8/0x3e0 [ 74.819056] kthread+0x153/0x1c0 [ 74.819963] ret_from_fork+0x22/0x30 Above issue may happen as follows: write truncate kjournald2 generic_perform_write ext4_write_begin ext4_walk_page_buffers do_journal_get_write_access ->add BJ_Reserved list ext4_journalled_write_end ext4_walk_page_buffers write_end_fn ext4_handle_dirty_metadata ***************JBD2 ABORT************** jbd2_journal_dirty_metadata -> return -EROFS, jh in reserved_list jbd2_journal_commit_transaction while (commit_transaction->t_reserved_list) jh = commit_transaction->t_reserved_list; truncate_pagecache_range do_invalidatepage ext4_journalled_invalidatepage jbd2_journal_invalidatepage journal_unmap_buffer __dispose_buffer __jbd2_journal_unfile_buffer jbd2_journal_put_journal_head ->put last ref_count __journal_remove_journal_head bh->b_private = NULL; jh->b_bh = NULL; jbd2_journal_refile_buffer(journal, jh); bh = jh2bh(jh); ->bh is NULL, later will trigger null-ptr-deref journal_free_journal_head(jh); After commit 96f1e0974575, we no longer hold the j_state_lock while iterating over the list of reserved handles in jbd2_journal_commit_transaction(). This potentially allows the journal_head to be freed by journal_unmap_buffer while the commit codepath is also trying to free the BJ_Reserved buffers. Keeping j_state_lock held while trying extends hold time of the lock minimally, and solves this issue. Fixes: 96f1e0974575("jbd2: avoid long hold times of j_state_lock while committing a transaction") Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220317142137.1821590-1-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-04-21KVM: SVM: Flush when freeing encrypted pages even on SME_COHERENT CPUsMingwei Zhang1-3/+6
Use clflush_cache_range() to flush the confidential memory when SME_COHERENT is supported in AMD CPU. Cache flush is still needed since SME_COHERENT only support cache invalidation at CPU side. All confidential cache lines are still incoherent with DMA devices. Cc: stable@vger.kerel.org Fixes: add5e2f04541 ("KVM: SVM: Add support for the SEV-ES VMSA") Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> Message-Id: <20220421031407.2516575-3-mizhang@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-21KVM: SVM: Simplify and harden helper to flush SEV guest page(s)Sean Christopherson1-34/+20
Rework sev_flush_guest_memory() to explicitly handle only a single page, and harden it to fall back to WBINVD if VM_PAGE_FLUSH fails. Per-page flushing is currently used only to flush the VMSA, and in its current form, the helper is completely broken with respect to flushing actual guest memory, i.e. won't work correctly for an arbitrary memory range. VM_PAGE_FLUSH takes a host virtual address, and is subject to normal page walks, i.e. will fault if the address is not present in the host page tables or does not have the correct permissions. Current AMD CPUs also do not honor SMAP overrides (undocumented in kernel versions of the APM), so passing in a userspace address is completely out of the question. In other words, KVM would need to manually walk the host page tables to get the pfn, ensure the pfn is stable, and then use the direct map to invoke VM_PAGE_FLUSH. And the latter might not even work, e.g. if userspace is particularly evil/clever and backs the guest with Secret Memory (which unmaps memory from the direct map). Signed-off-by: Sean Christopherson <seanjc@google.com> Fixes: add5e2f04541 ("KVM: SVM: Add support for the SEV-ES VMSA") Reported-by: Mingwei Zhang <mizhang@google.com> Cc: stable@vger.kernel.org Signed-off-by: Mingwei Zhang <mizhang@google.com> Message-Id: <20220421031407.2516575-2-mizhang@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-21KVM: selftests: Silence compiler warning in the kvm_page_table_testThomas Huth1-1/+1
When compiling kvm_page_table_test.c, I get this compiler warning with gcc 11.2: kvm_page_table_test.c: In function 'pre_init_before_test': ../../../../tools/include/linux/kernel.h:44:24: warning: comparison of distinct pointer types lacks a cast 44 | (void) (&_max1 == &_max2); \ | ^~ kvm_page_table_test.c:281:21: note: in expansion of macro 'max' 281 | alignment = max(0x100000, alignment); | ^~~ Fix it by adjusting the type of the absolute value. Signed-off-by: Thomas Huth <thuth@redhat.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Message-Id: <20220414103031.565037-1-thuth@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-21KVM: x86/pmu: Update AMD PMC sample period to fix guest NMI-watchdogLike Xu3-6/+12
NMI-watchdog is one of the favorite features of kernel developers, but it does not work in AMD guest even with vPMU enabled and worse, the system misrepresents this capability via /proc. This is a PMC emulation error. KVM does not pass the latest valid value to perf_event in time when guest NMI-watchdog is running, thus the perf_event corresponding to the watchdog counter will enter the old state at some point after the first guest NMI injection, forcing the hardware register PMC0 to be constantly written to 0x800000000001. Meanwhile, the running counter should accurately reflect its new value based on the latest coordinated pmc->counter (from vPMC's point of view) rather than the value written directly by the guest. Fixes: 168d918f2643 ("KVM: x86: Adjust counter sample period after a wrmsr") Reported-by: Dongli Cao <caodongli@kingsoft.com> Signed-off-by: Like Xu <likexu@tencent.com> Reviewed-by: Yanan Wang <wangyanan55@huawei.com> Tested-by: Yanan Wang <wangyanan55@huawei.com> Reviewed-by: Jim Mattson <jmattson@google.com> Message-Id: <20220409015226.38619-1-likexu@tencent.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-21x86/kvm: Preserve BSP MSR_KVM_POLL_CONTROL across suspend/resumeWanpeng Li1-0/+13
MSR_KVM_POLL_CONTROL is cleared on reset, thus reverting guests to host-side polling after suspend/resume. Non-bootstrap CPUs are restored correctly by the haltpoll driver because they are hot-unplugged during suspend and hot-plugged during resume; however, the BSP is not hotpluggable and remains in host-sde polling mode after the guest resume. The makes the guest pay for the cost of vmexits every time the guest enters idle. Fix it by recording BSP's haltpoll state and resuming it during guest resume. Cc: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1650267752-46796-1-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-21KVM: SPDX style and spelling fixesTom Rix3-4/+4
SPDX comments use use /* */ style comments in headers anad // style comments in .c files. Also fix two spelling mistakes. Signed-off-by: Tom Rix <trix@redhat.com> Message-Id: <20220410153840.55506-1-trix@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-21KVM: x86: Skip KVM_GUESTDBG_BLOCKIRQ APICv update if APICv is disabledSean Christopherson1-0/+3
Skip the APICv inhibit update for KVM_GUESTDBG_BLOCKIRQ if APICv is disabled at the module level to avoid having to acquire the mutex and potentially process all vCPUs. The DISABLE inhibit will (barring bugs) never be lifted, so piling on more inhibits is unnecessary. Fixes: cae72dcc3b21 ("KVM: x86: inhibit APICv when KVM_GUESTDBG_BLOCKIRQ active") Cc: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220420013732.3308816-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-21KVM: x86: Pend KVM_REQ_APICV_UPDATE during vCPU creation to fix a raceSean Christopherson1-1/+14
Make a KVM_REQ_APICV_UPDATE request when creating a vCPU with an in-kernel local APIC and APICv enabled at the module level. Consuming kvm_apicv_activated() and stuffing vcpu->arch.apicv_active directly can race with __kvm_set_or_clear_apicv_inhibit(), as vCPU creation happens before the vCPU is fully onlined, i.e. it won't get the request made to "all" vCPUs. If APICv is globally inhibited between setting apicv_active and onlining the vCPU, the vCPU will end up running with APICv enabled and trigger KVM's sanity check. Mark APICv as active during vCPU creation if APICv is enabled at the module level, both to be optimistic about it's final state, e.g. to avoid additional VMWRITEs on VMX, and because there are likely bugs lurking since KVM checks apicv_active in multiple vCPU creation paths. While keeping the current behavior of consuming kvm_apicv_activated() is arguably safer from a regression perspective, force apicv_active so that vCPU creation runs with deterministic state and so that if there are bugs, they are found sooner than later, i.e. not when some crazy race condition is hit. WARNING: CPU: 0 PID: 484 at arch/x86/kvm/x86.c:9877 vcpu_enter_guest+0x2ae3/0x3ee0 arch/x86/kvm/x86.c:9877 Modules linked in: CPU: 0 PID: 484 Comm: syz-executor361 Not tainted 5.16.13 #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1~cloud0 04/01/2014 RIP: 0010:vcpu_enter_guest+0x2ae3/0x3ee0 arch/x86/kvm/x86.c:9877 Call Trace: <TASK> vcpu_run arch/x86/kvm/x86.c:10039 [inline] kvm_arch_vcpu_ioctl_run+0x337/0x15e0 arch/x86/kvm/x86.c:10234 kvm_vcpu_ioctl+0x4d2/0xc80 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3727 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:874 [inline] __se_sys_ioctl fs/ioctl.c:860 [inline] __x64_sys_ioctl+0x16d/0x1d0 fs/ioctl.c:860 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae The bug was hit by a syzkaller spamming VM creation with 2 vCPUs and a call to KVM_SET_GUEST_DEBUG. r0 = openat$kvm(0xffffffffffffff9c, &(0x7f0000000000), 0x0, 0x0) r1 = ioctl$KVM_CREATE_VM(r0, 0xae01, 0x0) ioctl$KVM_CAP_SPLIT_IRQCHIP(r1, 0x4068aea3, &(0x7f0000000000)) (async) r2 = ioctl$KVM_CREATE_VCPU(r1, 0xae41, 0x0) (async) r3 = ioctl$KVM_CREATE_VCPU(r1, 0xae41, 0x400000000000002) ioctl$KVM_SET_GUEST_DEBUG(r3, 0x4048ae9b, &(0x7f00000000c0)={0x5dda9c14aa95f5c5}) ioctl$KVM_RUN(r2, 0xae80, 0x0) Reported-by: Gaoning Pan <pgn@zju.edu.cn> Reported-by: Yongkang Jia <kangel@zju.edu.cn> Fixes: 8df14af42f00 ("kvm: x86: Add support for dynamic APICv activation") Cc: stable@vger.kernel.org Cc: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220420013732.3308816-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-21KVM: nVMX: Defer APICv updates while L2 is active until L1 is activeSean Christopherson3-0/+11
Defer APICv updates that occur while L2 is active until nested VM-Exit, i.e. until L1 regains control. vmx_refresh_apicv_exec_ctrl() assumes L1 is active and (a) stomps all over vmcs02 and (b) neglects to ever updated vmcs01. E.g. if vmcs12 doesn't enable the TPR shadow for L2 (and thus no APICv controls), L1 performs nested VM-Enter APICv inhibited, and APICv becomes unhibited while L2 is active, KVM will set various APICv controls in vmcs02 and trigger a failed VM-Entry. The kicker is that, unless running with nested_early_check=1, KVM blames L1 and chaos ensues. In all cases, ignoring vmcs02 and always deferring the inhibition change to vmcs01 is correct (or at least acceptable). The ABSENT and DISABLE inhibitions cannot truly change while L2 is active (see below). IRQ_BLOCKING can change, but it is firmly a best effort debug feature. Furthermore, only L2's APIC is accelerated/virtualized to the full extent possible, e.g. even if L1 passes through its APIC to L2, normal MMIO/MSR interception will apply to the virtual APIC managed by KVM. The exception is the SELF_IPI register when x2APIC is enabled, but that's an acceptable hole. Lastly, Hyper-V's Auto EOI can technically be toggled if L1 exposes the MSRs to L2, but for that to work in any sane capacity, L1 would need to pass through IRQs to L2 as well, and IRQs must be intercepted to enable virtual interrupt delivery. I.e. exposing Auto EOI to L2 and enabling VID for L2 are, for all intents and purposes, mutually exclusive. Lack of dynamic toggling is also why this scenario is all but impossible to encounter in KVM's current form. But a future patch will pend an APICv update request _during_ vCPU creation to plug a race where a vCPU that's being created doesn't get included in the "all vCPUs request" because it's not yet visible to other vCPUs. If userspaces restores L2 after VM creation (hello, KVM selftests), the first KVM_RUN will occur while L2 is active and thus service the APICv update request made during VM creation. Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220420013732.3308816-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-21KVM: x86: Tag APICv DISABLE inhibit, not ABSENT, if APICv is disabledSean Christopherson1-1/+1
Set the DISABLE inhibit, not the ABSENT inhibit, if APICv is disabled via module param. A recent refactoring to add a wrapper for setting/clearing inhibits unintentionally changed the flag, probably due to a copy+paste goof. Fixes: 4f4c4a3ee53c ("KVM: x86: Trace all APICv inhibit changes and capture overall status") Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220420013732.3308816-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-21KVM: Initialize debugfs_dentry when a VM is created to avoid NULL derefSean Christopherson1-6/+6
Initialize debugfs_entry to its semi-magical -ENOENT value when the VM is created. KVM's teardown when VM creation fails is kludgy and calls kvm_uevent_notify_change() and kvm_destroy_vm_debugfs() even if KVM never attempted kvm_create_vm_debugfs(). Because debugfs_entry is zero initialized, the IS_ERR() checks pass and KVM derefs a NULL pointer. BUG: kernel NULL pointer dereference, address: 0000000000000018 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 1068b1067 P4D 1068b1067 PUD 1068b0067 PMD 0 Oops: 0000 [#1] SMP CPU: 0 PID: 871 Comm: repro Not tainted 5.18.0-rc1+ #825 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:__dentry_path+0x7b/0x130 Call Trace: <TASK> dentry_path_raw+0x42/0x70 kvm_uevent_notify_change.part.0+0x10c/0x200 [kvm] kvm_put_kvm+0x63/0x2b0 [kvm] kvm_dev_ioctl+0x43a/0x920 [kvm] __x64_sys_ioctl+0x83/0xb0 do_syscall_64+0x31/0x50 entry_SYSCALL_64_after_hwframe+0x44/0xae </TASK> Modules linked in: kvm_intel kvm irqbypass Fixes: a44a4cc1c969 ("KVM: Don't create VM debugfs files outside of the VM directory") Cc: stable@vger.kernel.org Cc: Marc Zyngier <maz@kernel.org> Cc: Oliver Upton <oupton@google.com> Reported-by: syzbot+df6fbbd2ee39f21289ef@syzkaller.appspotmail.com Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Oliver Upton <oupton@google.com> Message-Id: <20220415004622.2207751-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-21KVM: Add helpers to wrap vcpu->srcu_idx and yell if it's abusedSean Christopherson11-50/+71
Add wrappers to acquire/release KVM's SRCU lock when stashing the index in vcpu->src_idx, along with rudimentary detection of illegal usage, e.g. re-acquiring SRCU and thus overwriting vcpu->src_idx. Because the SRCU index is (currently) either 0 or 1, illegal nesting bugs can go unnoticed for quite some time and only cause problems when the nested lock happens to get a different index. Wrap the WARNs in PROVE_RCU=y, and make them ONCE, otherwise KVM will likely yell so loudly that it will bring the kernel to its knees. Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Fabiano Rosas <farosas@linux.ibm.com> Message-Id: <20220415004343.2203171-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-21KVM: RISC-V: Use kvm_vcpu.srcu_idx, drop RISC-V's unnecessary copySean Christopherson3-13/+10
Use the generic kvm_vcpu's srcu_idx instead of using an indentical field in RISC-V's version of kvm_vcpu_arch. Generic KVM very intentionally does not touch vcpu->srcu_idx, i.e. there's zero chance of running afoul of common code. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220415004343.2203171-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-21KVM: x86: Don't re-acquire SRCU lock in complete_emulated_io()Sean Christopherson1-6/+1
Don't re-acquire SRCU in complete_emulated_io() now that KVM acquires the lock in kvm_arch_vcpu_ioctl_run(). More importantly, don't overwrite vcpu->srcu_idx. If the index acquired by complete_emulated_io() differs from the one acquired by kvm_arch_vcpu_ioctl_run(), KVM will effectively leak a lock and hang if/when synchronize_srcu() is invoked for the relevant grace period. Fixes: 8d25b7beca7e ("KVM: x86: pull kvm->srcu read-side to kvm_arch_vcpu_ioctl_run") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220415004343.2203171-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-21Merge tag 'kvm-riscv-fixes-5.18-2' of https://github.com/kvm-riscv/linux ↵Paolo Bonzini1-9/+12
into HEAD KVM/riscv fixes for 5.18, take #2 - Remove 's' & 'u' as valid ISA extension - Do not allow disabling the base extensions 'i'/'m'/'a'/'c'