aboutsummaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)AuthorFilesLines
2023-08-21pgtable: create struct ptdescVishal Moola (Oracle)1-0/+70
Currently, page table information is stored within struct page. As part of simplifying struct page, create struct ptdesc for page table information. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vishal Moola (Oracle) <[email protected]> Acked-by: Mike Rapoport (IBM) <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Claudio Imbrenda <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Dinh Nguyen <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Guo Ren <[email protected]> Cc: Huacai Chen <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: John Paul Adrian Glaubitz <[email protected]> Cc: Jonas Bonn <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Yoshinori Sato <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21mm: add PAGE_TYPE_OP folio functionsVishal Moola (Oracle)1-7/+23
Patch series "Split ptdesc from struct page", v9. The MM subsystem is trying to shrink struct page. This patchset introduces a memory descriptor for page table tracking - struct ptdesc. This patchset introduces ptdesc, splits ptdesc from struct page, and converts many callers of page table constructor/destructors to use ptdescs. Ptdesc is a foundation to further standardize page tables, and eventually allow for dynamic allocation of page tables independent of struct page. However, the use of pages for page table tracking is quite deeply ingrained and varied across archictectures, so there is still a lot of work to be done before that can happen. This patch (of 31): No folio equivalents for page type operations have been defined, so define them for later folio conversions. Also changes the Page##uname macros to take in const struct page* since we only read the memory here. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vishal Moola (Oracle) <[email protected]> Acked-by: Mike Rapoport (IBM) <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Claudio Imbrenda <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Dinh Nguyen <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Huacai Chen <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jonas Bonn <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Yoshinori Sato <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Guo Ren <[email protected]> Cc: John Paul Adrian Glaubitz <[email protected]> Cc: Palmer Dabbelt <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21mm/memory_hotplug: embed vmem_altmap details in memory blockAneesh Kumar K.V1-6/+2
With memmap on memory, some architecture needs more details w.r.t altmap such as base_pfn, end_pfn, etc to unmap vmemmap memory. Instead of computing them again when we remove a memory block, embed vmem_altmap details in struct memory_block if we are using memmap on memory block feature. [[email protected]: fix error return code in add_memory_resource()] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Aneesh Kumar K.V <[email protected]> Signed-off-by: Yang Yingliang <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Vishal Verma <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21mm/memory_hotplug: allow memmap on memory hotplug request to fallbackAneesh Kumar K.V1-1/+2
If not supported, fallback to not using memap on memmory. This avoids the need for callers to do the fallback. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Aneesh Kumar K.V <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Vishal Verma <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21mm: memtest: convert to memtest_report_meminfo()Kefeng Wang1-6/+4
It is better to not expose too many internal variables of memtest, add a helper memtest_report_meminfo() to show memtest results. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kefeng Wang <[email protected]> Acked-by: Mike Rapoport (IBM) <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Tomas Mudrunka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21mm: move dummy_vm_ops out of a headerMateusz Guzik1-3/+3
Otherwise the kernel ends up with multiple copies: $ nm vmlinux | grep dummy_vm_ops ffffffff81e4ea00 d dummy_vm_ops.2 ffffffff81e11760 d dummy_vm_ops.254 ffffffff81e406e0 d dummy_vm_ops.4 ffffffff81e3c780 d dummy_vm_ops.7 While here prefix it with vma_. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mateusz Guzik <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21mm: lock vma explicitly before doing vm_flags_reset and vm_flags_reset_onceSuren Baghdasaryan1-3/+7
Implicit vma locking inside vm_flags_reset() and vm_flags_reset_once() is not obvious and makes it hard to understand where vma locking is happening. Also in some cases (like in dup_userfaultfd()) vma should be locked earlier than vma_flags modification. To make locking more visible, change these functions to assert that the vma write lock is taken and explicitly lock the vma beforehand. Fix userfaultfd functions which should lock the vma earlier. Link: https://lkml.kernel.org/r/[email protected] Suggested-by: Linus Torvalds <[email protected]> Signed-off-by: Suren Baghdasaryan <[email protected]> Cc: Jann Horn <[email protected]> Cc: Liam R. Howlett <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21mm: for !CONFIG_PER_VMA_LOCK equate write lock assertion for vma and mmapSuren Baghdasaryan1-1/+2
When CONFIG_PER_VMA_LOCK=n, vma_assert_write_locked() should be equivalent to mmap_assert_write_locked(). Link: https://lkml.kernel.org/r/[email protected] Suggested-by: Jann Horn <[email protected]> Signed-off-by: Suren Baghdasaryan <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Cc: Linus Torvalds <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21mm/damon/core: implement target type damos filterSeongJae Park1-0/+6
One DAMON context can have multiple monitoring targets, and DAMOS schemes are applied to all targets. In some cases, users need to apply different scheme to different targets. Retrieving monitoring results via DAMON sysfs interface' 'tried_regions' directory could be one good example. Also, there could be cases that cgroup DAMOS filter is not enough. All such use cases can be worked around by having multiple DAMON contexts having only single target, but it is inefficient in terms of resource usage, thogh the overhead is not estimated to be huge. Implement DAMON monitoring target based DAMOS filter for the case. Like address range target DAMOS filter, handle these filters in the DAMON core layer, since it is more efficient than doing in operations set layer. This also means that regions that filtered out by monitoring target type DAMOS filters are counted as not tried by the scheme. Hence, target granularity monitoring results retrieval via DAMON sysfs interface becomes available. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Brendan Higgins <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21mm/damon/core: introduce address range type damos filterSeongJae Park1-6/+16
Patch series "Extend DAMOS filters for address ranges and DAMON monitoring targets" There are use cases that need to apply DAMOS schemes to specific address ranges or DAMON monitoring targets. NUMA nodes in the physical address space, special memory objects in the virtual address space, and monitoring target specific efficient monitoring results snapshot retrieval could be examples of such use cases. This patchset extends DAMOS filters feature for such cases, by implementing two more filter types, namely address ranges and DAMON monitoring types. Patches sequence ---------------- The first seven patches are for the address ranges based DAMOS filter. The first patch implements the filter feature and expose it via DAMON kernel API. The second patch further expose the feature to users via DAMON sysfs interface. The third and fourth patches implement unit tests and selftests for the feature. Three patches (fifth to seventh) updating the documents follow. The following six patches are for the DAMON monitoring target based DAMOS filter. The eighth patch implements the feature in the core layer and expose it via DAMON's kernel API. The ninth patch further expose it to users via DAMON sysfs interface. Tenth patch add a selftest, and two patches (eleventh and twelfth) update documents. [1] https://lore.kernel.org/damon/[email protected]/ This patch (of 13): Users can know special characteristic of specific address ranges. NUMA nodes or special objects or buffers in virtual address space could be such examples. For such cases, DAMOS schemes could required to be applied to only specific address ranges. Implement yet another type of DAMOS filter for the purpose. Note that the existing filter types, namely anon pages and memcg DAMOS filters needed page level type check. Because such check can be done efficiently in the opertions set layer, those filters are handled in operations set layer. Specifically, only paddr operations set implementation supports these filters. Also, because statistics counting is done in the DAMON core layer, the regions that filtered out by these filters are counted as tried but failed to the statistics. Unlike those, address range based filters can efficiently handled in the core layer. Hence, do the handling in the layer, and count the regions that filtered out by those as the scheme has not tried for the region. This difference should clearly documented. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Brendan Higgins <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21mm/memcg: update obsolete comment above parent_mem_cgroup()Miaohe Lin1-2/+1
Since commit bef8620cd8e0 ("mm: memcg: deprecate the non-hierarchical mode"), use_hierarchy is already deprecated. And it's further removed via commit 9d9d341df4d5 ("cgroup: remove obsoleted broken_hierarchy and warned_broken_hierarchy"). Update corresponding comment. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Muchun Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21mm: factor out VMA stack and heap checksKefeng Wang1-0/+25
Patch series "mm: convert to vma_is_initial_heap/stack()", v3. Add vma_is_initial_stack() and vma_is_initial_heap() helpers and use them to simplify code. This patch (of 4): Factor out VMA stack and heap checks and name them vma_is_initial_stack() and vma_is_initial_heap() for general use. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kefeng Wang <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Cc: Christian Göttsche <[email protected]> Cc: Alex Deucher <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Christian Göttsche <[email protected]> Cc: Christian König <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: David Airlie <[email protected]> Cc: Eric Paris <[email protected]> Cc: Felix Kuehling <[email protected]> Cc: "Pan, Xinhui" <[email protected]> Cc: Paul Moore <[email protected]> Cc: Stephen Smalley <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21mm/page_ext: move page_ext_operations definition under CONFIG_PAGE_EXTENSIONKemeng Shi1-2/+1
page_ext_operations should only be defined when CONFIG_PAGE_EXTENSION is enabled. Besides, this may detect missing reliance on CONFIG_PAGE_EXTENSION from future Page Extension clients at compile time. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kemeng Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21mm/memcg: fix obsolete function name in mem_cgroup_protection()Miaohe Lin1-1/+1
Commit 45c7f7e1ef17 ("mm, memcg: decouple e{low,min} state mutations from protection checks") changed the function name but not the corresponding comment. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Muchun Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21mm/page_ext: add common function to get client data from page_extKemeng Shi1-0/+6
Patch series "add page_ext_data to get client data in page_ext". Current clients get data from page_ext by adding offset which is auto generated in page_ext core and exposes the data layout design inside page_ext core. This series adds a page_ext_data() to hide this from clients. Benefits include: 1. Future clients can call page_ext_data directly instead of defining a new function like get_page_owner to get the data. 2. There is no change to clients if the layout of page_ext data changes. This patch (of 3): Add common page_ext_data function to get client data. This could hide offset which is auto generated in page_ext core and expose the desgin of page_ext data layout. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kemeng Shi <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Acked-by: Mike Rapoport (IBM) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21zswap: make zswap_load() take a folioMatthew Wilcox (Oracle)1-2/+2
Only convert a few easy parts of this function to use the folio passed in; convert back to struct page for the majority of it. Removes three hidden calls to compound_head(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Domenico Cerasuolo <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Nhat Pham <[email protected]> Cc: Vitaly Wool <[email protected]> Cc: Yosry Ahmed <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21memcg: convert get_obj_cgroup_from_page to get_obj_cgroup_from_folioMatthew Wilcox (Oracle)1-2/+2
As the one caller now has a folio, pass it in and use it. Removes three calls to compound_head(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Domenico Cerasuolo <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Nhat Pham <[email protected]> Cc: Vitaly Wool <[email protected]> Cc: Yosry Ahmed <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21zswap: make zswap_store() take a folioMatthew Wilcox (Oracle)1-2/+2
Patch series "Followup folio conversions for zswap". With frontswap killed, it's worth converting the zswap_load() and zswap_store() functions to take a folio instead of a page pointer. They aren't converted to support large folios, but there are a lot of unnecessary calls to compound_head() that are removed by these patches. This patch (of 4): Only convert a few easy parts of this function to use the folio passed in; convert back to struct page for the majority of it. This does remove a few hidden calls to compound_head(). Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Domenico Cerasuolo <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Nhat Pham <[email protected]> Cc: Vitaly Wool <[email protected]> Cc: Yosry Ahmed <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21mm: kill frontswapJohannes Weiner4-105/+37
The only user of frontswap is zswap, and has been for a long time. Have swap call into zswap directly and remove the indirection. [[email protected]: remove obsolete comment, per Yosry] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: don't warn if none swapcache folio is passed to zswap_load] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Signed-off-by: Yin Fengwei <[email protected]> Acked-by: Konrad Rzeszutek Wilk <[email protected]> Acked-by: Nhat Pham <[email protected]> Acked-by: Yosry Ahmed <[email protected]> Acked-by: Christoph Hellwig <[email protected]> Cc: Domenico Cerasuolo <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Vitaly Wool <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21scsi: libsas: Remove unused declarationsYue Haibing1-2/+0
Commit 042ebd293b86 ("scsi: libsas: kill useless ha_event and do some cleanup") removed sas_hae_reset() but not its declaration. Commit 2908d778ab3e ("[SCSI] aic94xx: new driver") declared but never implemented other functions. Signed-off-by: Yue Haibing <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: John Garry <[email protected]> Reviewed-by: Jason Yan <[email protected]> Signed-off-by: Martin K. Petersen <[email protected]>
2023-08-21mm: enable page walking API to lock vmas during the walkSuren Baghdasaryan1-0/+11
walk_page_range() and friends often operate under write-locked mmap_lock. With introduction of vma locks, the vmas have to be locked as well during such walks to prevent concurrent page faults in these areas. Add an additional member to mm_walk_ops to indicate locking requirements for the walk. The change ensures that page walks which prevent concurrent page faults by write-locking mmap_lock, operate correctly after introduction of per-vma locks. With per-vma locks page faults can be handled under vma lock without taking mmap_lock at all, so write locking mmap_lock would not stop them. The change ensures vmas are properly locked during such walks. A sample issue this solves is do_mbind() performing queue_pages_range() to queue pages for migration. Without this change a concurrent page can be faulted into the area and be left out of migration. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Suren Baghdasaryan <[email protected]> Suggested-by: Linus Torvalds <[email protected]> Suggested-by: Jann Horn <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Laurent Dufour <[email protected]> Cc: Liam Howlett <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: Peter Xu <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21smaps: use vm_normal_page_pmd() instead of follow_trans_huge_pmd()David Hildenbrand1-3/+0
We shouldn't be using a GUP-internal helper if it can be avoided. Similar to smaps_pte_entry() that uses vm_normal_page(), let's use vm_normal_page_pmd() that similarly refuses to return the huge zeropage. In contrast to follow_trans_huge_pmd(), vm_normal_page_pmd(): (1) Will always return the head page, not a tail page of a THP. If we'd ever call smaps_account with a tail page while setting "compound = true", we could be in trouble, because smaps_account() would look at the memmap of unrelated pages. If we're unlucky, that memmap does not exist at all. Before we removed PG_doublemap, we could have triggered something similar as in commit 24d7275ce279 ("fs/proc: task_mmu.c: don't read mapcount for migration entry"). This can theoretically happen ever since commit ff9f47f6f00c ("mm: proc: smaps_rollup: do not stall write attempts on mmap_lock"): (a) We're in show_smaps_rollup() and processed a VMA (b) We release the mmap lock in show_smaps_rollup() because it is contended (c) We merged that VMA with another VMA (d) We collapsed a THP in that merged VMA at that position If the end address of the original VMA falls into the middle of a THP area, we would call smap_gather_stats() with a start address that falls into a PMD-mapped THP. It's probably very rare to trigger when not really forced. (2) Will succeed on a is_pci_p2pdma_page(), like vm_normal_page() Treat such PMDs here just like smaps_pte_entry() would treat such PTEs. If such pages would be anonymous, we most certainly would want to account them. (3) Will skip over pmd_devmap(), like vm_normal_page() for pte_devmap() As noted in vm_normal_page(), that is only for handling legacy ZONE_DEVICE pages. So just like smaps_pte_entry(), we'll now also ignore such PMD entries. Especially, follow_pmd_mask() never ends up calling follow_trans_huge_pmd() on pmd_devmap(). Instead it calls follow_devmap_pmd() -- which will fail if neither FOLL_GET nor FOLL_PIN is set. So skipping pmd_devmap() pages seems to be the right thing to do. (4) Will properly handle VM_MIXEDMAP/VM_PFNMAP, like vm_normal_page() We won't be returning a memmap that should be ignored by core-mm, or worse, a memmap that does not even exist. Note that while walk_page_range() will skip VM_PFNMAP mappings, walk_page_vma() won't. Most probably this case doesn't currently really happen on the PMD level, otherwise we'd already be able to trigger kernel crashes when reading smaps / smaps_rollup. So most probably only (1) is relevant in practice as of now, but could only cause trouble in extreme corner cases. Let's move follow_trans_huge_pmd() to mm/internal.h to discourage future reuse in wrong context. Link: https://lkml.kernel.org/r/[email protected] Fixes: ff9f47f6f00c ("mm: proc: smaps_rollup: do not stall write attempts on mmap_lock") Signed-off-by: David Hildenbrand <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: John Hubbard <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: liubo <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Peter Xu <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULTDavid Hildenbrand2-6/+24
Unfortunately commit 474098edac26 ("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()") missed that follow_page() and follow_trans_huge_pmd() never implicitly set FOLL_NUMA because they really don't want to fail on PROT_NONE-mapped pages -- either due to NUMA hinting or due to inaccessible (PROT_NONE) VMAs. As spelled out in commit 0b9d705297b2 ("mm: numa: Support NUMA hinting page faults from gup/gup_fast"): "Other follow_page callers like KSM should not use FOLL_NUMA, or they would fail to get the pages if they use follow_page instead of get_user_pages." liubo reported [1] that smaps_rollup results are imprecise, because they miss accounting of pages that are mapped PROT_NONE. Further, it's easy to reproduce that KSM no longer works on inaccessible VMAs on x86-64, because pte_protnone()/pmd_protnone() also indictaes "true" in inaccessible VMAs, and follow_page() refuses to return such pages right now. As KVM really depends on these NUMA hinting faults, removing the pte_protnone()/pmd_protnone() handling in GUP code completely is not really an option. To fix the issues at hand, let's revive FOLL_NUMA as FOLL_HONOR_NUMA_FAULT to restore the original behavior for now and add better comments. Set FOLL_HONOR_NUMA_FAULT independent of FOLL_FORCE in is_valid_gup_args(), to add that flag for all external GUP users. Note that there are three GUP-internal __get_user_pages() users that don't end up calling is_valid_gup_args() and consequently won't get FOLL_HONOR_NUMA_FAULT set. 1) get_dump_page(): we really don't want to handle NUMA hinting faults. It specifies FOLL_FORCE and wouldn't have honored NUMA hinting faults already. 2) populate_vma_page_range(): we really don't want to handle NUMA hinting faults. It specifies FOLL_FORCE on accessible VMAs, so it wouldn't have honored NUMA hinting faults already. 3) faultin_vma_page_range(): we similarly don't want to handle NUMA hinting faults. To make the combination of FOLL_FORCE and FOLL_HONOR_NUMA_FAULT work in inaccessible VMAs properly, we have to perform VMA accessibility checks in gup_can_follow_protnone(). As GUP-fast should reject such pages either way in pte_access_permitted()/pmd_access_permitted() -- for example on x86-64 and arm64 that both implement pte_protnone() -- let's just always fallback to ordinary GUP when stumbling over pte_protnone()/pmd_protnone(). As Linus notes [2], honoring NUMA faults might only make sense for selected GUP users. So we should really see if we can instead let relevant GUP callers specify it manually, and not trigger NUMA hinting faults from GUP as default. Prepare for that by making FOLL_HONOR_NUMA_FAULT an external GUP flag and adding appropriate documenation. While at it, remove a stale comment from follow_trans_huge_pmd(): That comment for pmd_protnone() was added in commit 2b4847e73004 ("mm: numa: serialise parallel get_user_page against THP migration"), which noted: THP does not unmap pages due to a lack of support for migration entries at a PMD level. This allows races with get_user_pages Nowadays, we do have PMD migration entries, so the comment no longer applies. Let's drop it. [1] https://lore.kernel.org/r/[email protected] [2] https://lore.kernel.org/r/CAHk-=wgRiP_9X0rRdZKT8nhemZGNateMtb366t37d8-x7VRs=g@mail.gmail.com Link: https://lkml.kernel.org/r/[email protected] Fixes: 474098edac26 ("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()") Signed-off-by: David Hildenbrand <[email protected]> Reported-by: liubo <[email protected]> Closes: https://lore.kernel.org/r/[email protected] Reported-by: Peter Xu <[email protected]> Closes: https://lore.kernel.org/all/ZMKJjDaqZ7FW0jfe@x1n/ Acked-by: Mel Gorman <[email protected]> Acked-by: Peter Xu <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: John Hubbard <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Shuah Khan <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21nsproxy: Convert nsproxy.count to refcount_tElena Reshetova1-4/+3
atomic_t variables are currently used to implement reference counters with the following properties: - counter is initialized to 1 using atomic_set() - a resource is freed upon counter reaching zero - once counter reaches zero, its further increments aren't allowed - counter schema uses basic atomic operations (set, inc, inc_not_zero, dec_and_test, etc.) Such atomic variables should be converted to a newly provided refcount_t type and API that prevents accidental counter overflows and underflows. This is important since overflows and underflows can lead to use-after-free situation and be exploitable. The variable nsproxy.count is used as pure reference counter. Convert it to refcount_t and fix up the operations. **Important note for maintainers: Some functions from refcount_t API defined in refcount.h have different memory ordering guarantees than their atomic counterparts. Please check Documentation/core-api/refcount-vs-atomic.rst for more information. Normally the differences should not matter since refcount_t provides enough guarantees to satisfy the refcounting use cases, but in some rare cases it might matter. Please double check that you don't have some undocumented memory guarantees for this variable usage. For the nsproxy.count it might make a difference in following places: - put_nsproxy() and switch_task_namespaces(): decrement in refcount_dec_and_test() only provides RELEASE ordering and ACQUIRE ordering on success vs. fully ordered atomic counterpart Suggested-by: Kees Cook <[email protected]> Signed-off-by: Elena Reshetova <[email protected]> Reviewed-by: David Windsor <[email protected]> Reviewed-by: Hans Liljestrand <[email protected]> Reviewed-by: Christian Brauner <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Kees Cook <[email protected]>
2023-08-21ACPI: x86: s2idle: Add a function to get LPS0 constraint for a deviceMario Limonciello1-0/+6
Other parts of the kernel may use LPS0 constraints information to make decisions on what power state to put a device into. Signed-off-by: Mario Limonciello <[email protected]> [ rjw: Rewrite kerneldoc, rearrange if () statement, edit subject and changelog ] Signed-off-by: Rafael J. Wysocki <[email protected]>
2023-08-21ACPI: Adjust #ifdef for *_lps0_dev useMario Limonciello1-2/+2
The `#ifdef` for acpi_register_lps0_dev() currently is guarded against `CONFIG_X86`, but actually the functions contained in the block are specifically sleep related functions. Adjust the guard to also check for `CONFIG_SUSPEND`. Signed-off-by: Mario Limonciello <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2023-08-21super: wait until we passed kill superChristian Brauner1-0/+1
Recent rework moved block device closing out of sb->put_super() and into sb->kill_sb() to avoid deadlocks as s_umount is held in put_super() and blkdev_put() can end up taking s_umount again. That means we need to move the removal of the superblock from @fs_supers out of generic_shutdown_super() and into deactivate_locked_super() to ensure that concurrent mounters don't fail to open block devices that are still in use because blkdev_put() in sb->kill_sb() hasn't been called yet. We can now do this as we can make iterators through @fs_super and @super_blocks wait without holding s_umount. Concurrent mounts will wait until a dying superblock is fully dead so until sb->kill_sb() has been called and SB_DEAD been set. Concurrent iterators can already discard any SB_DYING superblock. Reviewed-by: Jan Kara <[email protected]> Message-Id: <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2023-08-21super: wait for nascent superblocksChristian Brauner1-0/+1
Recent patches experiment with making it possible to allocate a new superblock before opening the relevant block device. Naturally this has intricate side-effects that we get to learn about while developing this. Superblock allocators such as sget{_fc}() return with s_umount of the new superblock held and lock ordering currently requires that block level locks such as bdev_lock and open_mutex rank above s_umount. Before aca740cecbe5 ("fs: open block device after superblock creation") ordering was guaranteed to be correct as block devices were opened prior to superblock allocation and thus s_umount wasn't held. But now s_umount must be dropped before opening block devices to avoid locking violations. This has consequences. The main one being that iterators over @super_blocks and @fs_supers that grab a temporary reference to the superblock can now also grab s_umount before the caller has managed to open block devices and called fill_super(). So whereas before such iterators or concurrent mounts would have simply slept on s_umount until SB_BORN was set or the superblock was discard due to initalization failure they can now needlessly spin through sget{_fc}(). If the caller is sleeping on bdev_lock or open_mutex one caller waiting on SB_BORN will always spin somewhere and potentially this can go on for quite a while. It should be possible to drop s_umount while allowing iterators to wait on a nascent superblock to either be born or discarded. This patch implements a wait_var_event() mechanism allowing iterators to sleep until they are woken when the superblock is born or discarded. This also allows us to avoid relooping through @fs_supers and @super_blocks if a superblock isn't yet born or dying. Link: aca740cecbe5 ("fs: open block device after superblock creation") Reviewed-by: Jan Kara <[email protected]> Message-Id: <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2023-08-21efi/runtime-wrappers: Use type safe encapsulation of call argumentsArd Biesheuvel1-10/+8
The current code that marshalls the EFI runtime call arguments to hand them off to a async helper does so in a type unsafe and slightly messy manner - everything is cast to void* except for some integral types that are passed by reference and dereferenced on the receiver end. Let's clean this up a bit, and record the arguments of each runtime service invocation exactly as they are issued, in a manner that permits the compiler to check the types of the arguments at both ends. Signed-off-by: Ard Biesheuvel <[email protected]>
2023-08-21pm: Introduce DEFINE_NOIRQ_DEV_PM_OPS() helperAndy Shevchenko1-0/+9
_DEFINE_DEV_PM_OPS() helps to define PM operations for the system sleep and/or runtime PM cases. Some of the existing users want to have _noirq() variants to be set. For that purpose introduce a new helper which sets up _noirq() callbacks to be assigned and struct dev_pm_ops be provided. Acked-by: "Rafael J. Wysocki" <[email protected]> Reviewed-by: Jonathan Cameron <[email protected]> Reviewed-by: Paul Cercueil <[email protected]> Reviewed-by: Geert Uytterhoeven <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Andy Shevchenko <[email protected]>
2023-08-21fs: create kiocb_{start,end}_write() helpersAmir Goldstein1-0/+36
aio, io_uring, cachefiles and overlayfs, all open code an ugly variant of file_{start,end}_write() to silence lockdep warnings. Create helpers for this lockdep dance so we can use the helpers in all the callers. Suggested-by: Jan Kara <[email protected]> Signed-off-by: Amir Goldstein <[email protected]> Reviewed-by: Jan Kara <[email protected]> Reviewed-by: Jens Axboe <[email protected]> Message-Id: <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2023-08-21fs: add kerneldoc to file_{start,end}_write() helpersAmir Goldstein1-1/+14
and use sb_end_write() instead of open coded version. Signed-off-by: Amir Goldstein <[email protected]> Reviewed-by: Jan Kara <[email protected]> Reviewed-by: Jens Axboe <[email protected]> Message-Id: <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2023-08-21spi: sh-msiof: switch to use modern nameYang Yingliang1-2/+2
Change legacy name master/slave to modern name host/target. No functional changed. Signed-off-by: Yang Yingliang <[email protected]> Reviewed-by: Geert Uytterhoeven <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Mark Brown <[email protected]>
2023-08-21spi: pxa2xx: switch to use modern nameYang Yingliang1-2/+2
Change legacy name master/slave to modern name host/target or controller. No functional changed. Signed-off-by: Yang Yingliang <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Mark Brown <[email protected]>
2023-08-21btrfs: remove v0 extent handlingQu Wenruo2-2/+5
The v0 extent item has been deprecated for a long time, and we don't have any report from the community either. So it's time to remove the v0 extent specific error handling, and just treat them as regular extent tree corruption. This patch would remove the btrfs_print_v0_err() helper, and enhance the involved error handling to treat them just as any extent tree corruption. No reports regarding v0 extents have been seen since the graceful handling was added in 2018. This involves: - btrfs_backref_add_tree_node() This change is a little tricky, the new code is changed to only handle BTRFS_TREE_BLOCK_REF_KEY and BTRFS_SHARED_BLOCK_REF_KEY. But this is safe, as we have rejected any unknown inline refs through btrfs_get_extent_inline_ref_type(). For keyed backrefs, we're safe to skip anything we don't know (that's if it can pass tree-checker in the first place). - btrfs_lookup_extent_info() - lookup_inline_extent_backref() - run_delayed_extent_op() - __btrfs_free_extent() - add_tree_block() Regular error handling of unexpected extent tree item, and abort transaction (if we have a trans handle). - remove_extent_data_ref() It's pretty much the same as the regular rejection of unknown backref key. But for this particular case, we can also remove a BUG_ON(). - extent_data_ref_count() We can remove the BTRFS_EXTENT_REF_V0_KEY BUG_ON(), as it would be rejected by the only caller. - btrfs_print_leaf() Remove the handling for BTRFS_EXTENT_REF_V0_KEY. Signed-off-by: Qu Wenruo <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2023-08-21mm: remove folio_account_redirtyChristoph Hellwig1-5/+0
Fold folio_account_redirty into folio_redirty_for_writepage now that all other users except for the also unused account_page_redirty wrapper are gone. Reviewed-by: Josef Bacik <[email protected]> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2023-08-21btrfs: tracepoints: simplify raid56 eventsQu Wenruo1-27/+2
After commit 6bfd0133bee2 ("btrfs: raid56: switch scrub path to use a single function"), the raid56 implementation no longer uses different endio functions for RMW/recover/scrub. All read operations end in submit_read_wait_bio_list(), while all write operations end in submit_write_bios(). This means quite some trace events are out-of-date and no longer utilized. This patch would unify the trace events into just two: - trace_raid56_read() Replaces trace_raid56_read_partial(), trace_raid56_scrub_read() and trace_raid56_scrub_read_recover(). - trace_raid56_write() Replaces trace_raid56_write_stripe() and trace_raid56_scrub_write_stripe(). Signed-off-by: Qu Wenruo <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2023-08-21fs: remove get_superChristoph Hellwig1-1/+0
get_super is unused now, remove it. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Christian Brauner <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Message-Id: <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2023-08-21block: call into the file system for ioctl BLKFLSBUFChristoph Hellwig1-2/+5
BLKFLSBUF is a historic ioctl that is called on a file handle to a block device and syncs either the file system mounted on that block device if there is one, or otherwise the just the data on the block device. Replace the get_super based syncing with a holder operation to remove the last usage of get_super, and to also support syncing the file system if the block device is not the main block device stored in s_dev. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Message-Id: <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2023-08-21block: call into the file system for bdev_mark_deadChristoph Hellwig1-1/+1
Combine the newly merged bdev_mark_dead helper with the existing mark_dead holder operation so that all operations that invalidate a device that is dead or being removed now go through the holder ops. This allows file systems to explicitly shutdown either ASAP (for a surprise removal) or after writing back data (for an orderly removal), and do so not only for the main device. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Message-Id: <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2023-08-21block: consolidate __invalidate_device and fsync_bdevChristoph Hellwig1-1/+1
We currently have two interfaces that take a block_devices and the find a mounted file systems to flush or invaldidate data on it. Both are a bit problematic because they only work for the "main" block devices that is used as s_dev for the super_block, and because they don't call into the file system at all. Merge the two into a new bdev_mark_dead helper that does both the syncing and invalidation and which is properly documented. This is in preparation of merging the functionality into the ->mark_dead holder operation so that it will work on additional block devices used by a file systems and give us a single entry point for invalidation of dead devices or media. Note that a single standalone fsync_bdev call for an obscure ioctl remains for now, but that one will also be deal with in a bit. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Message-Id: <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2023-08-21block: simplify the disk_force_media_change interfaceChristoph Hellwig1-1/+1
Hard code the events to DISK_EVENT_MEDIA_CHANGE as that is the only useful use case, and drop the superfluous return value. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Message-Id: <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2023-08-21Merge branches 'apple/dart', 'arm/mediatek', 'arm/renesas', 'arm/rockchip', ↵Joerg Roedel5-3/+505
'arm/smmu', 'unisoc', 'x86/vt-d', 'x86/amd' and 'core' into next
2023-08-21drm/bridge: Fix kernel-doc typo in desc of output_bus_cfg in drm_bridge_stateDouglas Anderson1-1/+1
There's an obvious copy-paste error in the description of output_bus_cfg. Fix it. Fixes: f32df58acc68 ("drm/bridge: Add the necessary bits to support bus format negotiation") Signed-off-by: Douglas Anderson <[email protected]> Signed-off-by: Maxime Ripard <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/20230817094808.1.I41b04c3a8305c9f1c17af886c327941c5136ca3b@changeid
2023-08-21wifi: mac80211: limit reorder_buf_filtered to avoid UBSAN warningPing-Ke Shih1-0/+1
The commit 06470f7468c8 ("mac80211: add API to allow filtering frames in BA sessions") added reorder_buf_filtered to mark frames filtered by firmware, and it can only work correctly if hw.max_rx_aggregation_subframes <= 64 since it stores the bitmap in a u64 variable. However, new HE or EHT devices can support BlockAck number up to 256 or 1024, and then using a higher subframe index leads UBSAN warning: UBSAN: shift-out-of-bounds in net/mac80211/rx.c:1129:39 shift exponent 215 is too large for 64-bit type 'long long unsigned int' Call Trace: <IRQ> dump_stack_lvl+0x48/0x70 dump_stack+0x10/0x20 __ubsan_handle_shift_out_of_bounds+0x1ac/0x360 ieee80211_release_reorder_frame.constprop.0.cold+0x64/0x69 [mac80211] ieee80211_sta_reorder_release+0x9c/0x400 [mac80211] ieee80211_prepare_and_rx_handle+0x1234/0x1420 [mac80211] ieee80211_rx_list+0xaef/0xf60 [mac80211] ieee80211_rx_napi+0x53/0xd0 [mac80211] Since only old hardware that supports <=64 BlockAck uses ieee80211_mark_rx_ba_filtered_frames(), limit the use as it is, so add a WARN_ONCE() and comment to note to avoid using this function if hardware capability is not suitable. Signed-off-by: Ping-Ke Shih <[email protected]> Link: https://lore.kernel.org/r/[email protected] [edit commit message] Signed-off-by: Johannes Berg <[email protected]>
2023-08-21xen/evtchn: Remove unused function declaration xen_set_affinity_evtchn()Yue Haibing1-1/+0
Commit 67473b8194bc ("xen/events: Remove disfunct affinity spreading") leave this unused declaration. Signed-off-by: Yue Haibing <[email protected]> Reviewed-by: Juergen Gross <[email protected]> Reviewed-by: Rahul Singh <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Juergen Gross <[email protected]>
2023-08-20ublk: zoned: support REQ_OP_ZONE_RESET_ALLMing Lei1-0/+1
There isn't any reason to not support REQ_OP_ZONE_RESET_ALL given everything is actually handled in userspace, not mention it is pretty easy to support RESET_ALL. So enable REQ_OP_ZONE_RESET_ALL and let userspace handle it. Verified by 'tools/zbc_reset_zone -all /dev/ublkb0' in libzbc[1] with libublk-rs based ublk-zoned target prototype[2], follows command line for creating ublk-zoned: cargo run --example zoned -- add -1 1024 # add $dev_id $DEV_SIZE [1] https://github.com/westerndigitalcorporation/libzbc [2] https://github.com/ming1/libublk-rs/tree/zoned.v2 Cc: Niklas Cassel <[email protected]> Cc: Damien Le Moal <[email protected]> Cc: Andreas Hindborg <[email protected]> Signed-off-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-08-20net: selectively purge error queue in IP_RECVERR / IPV6_RECVERREric Dumazet1-0/+1
Setting IP_RECVERR and IPV6_RECVERR options to zero currently purges the socket error queue, which was probably not expected for zerocopy and tx_timestamp users. I discovered this issue while preparing commit 6b5f43ea0815 ("inet: move inet->recverr to inet->inet_flags"), I presume this change does not need to be backported to stable kernels. Add skb_errqueue_purge() helper to purge error messages only. Signed-off-by: Eric Dumazet <[email protected]> Cc: Willem de Bruijn <[email protected]> Cc: Soheil Hassas Yeganeh <[email protected]> Reviewed-by: Willem de Bruijn <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-08-20Merge commit b320441c04c9 ("Merge tag 'tty-6.5-rc7' of ↵Greg Kroah-Hartman31-127/+167
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty") into tty-next We need the serial-core fixes in here as well. Signed-off-by: Greg Kroah-Hartman <[email protected]>
2023-08-20ipv4: fix data-races around inet->inet_idEric Dumazet2-3/+14
UDP sendmsg() is lockless, so ip_select_ident_segs() can very well be run from multiple cpus [1] Convert inet->inet_id to an atomic_t, but implement a dedicated path for TCP, avoiding cost of a locked instruction (atomic_add_return()) Note that this patch will cause a trivial merge conflict because we added inet->flags in net-next tree. v2: added missing change in drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c (David Ahern) [1] BUG: KCSAN: data-race in __ip_make_skb / __ip_make_skb read-write to 0xffff888145af952a of 2 bytes by task 7803 on cpu 1: ip_select_ident_segs include/net/ip.h:542 [inline] ip_select_ident include/net/ip.h:556 [inline] __ip_make_skb+0x844/0xc70 net/ipv4/ip_output.c:1446 ip_make_skb+0x233/0x2c0 net/ipv4/ip_output.c:1560 udp_sendmsg+0x1199/0x1250 net/ipv4/udp.c:1260 inet_sendmsg+0x63/0x80 net/ipv4/af_inet.c:830 sock_sendmsg_nosec net/socket.c:725 [inline] sock_sendmsg net/socket.c:748 [inline] ____sys_sendmsg+0x37c/0x4d0 net/socket.c:2494 ___sys_sendmsg net/socket.c:2548 [inline] __sys_sendmmsg+0x269/0x500 net/socket.c:2634 __do_sys_sendmmsg net/socket.c:2663 [inline] __se_sys_sendmmsg net/socket.c:2660 [inline] __x64_sys_sendmmsg+0x57/0x60 net/socket.c:2660 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd read to 0xffff888145af952a of 2 bytes by task 7804 on cpu 0: ip_select_ident_segs include/net/ip.h:541 [inline] ip_select_ident include/net/ip.h:556 [inline] __ip_make_skb+0x817/0xc70 net/ipv4/ip_output.c:1446 ip_make_skb+0x233/0x2c0 net/ipv4/ip_output.c:1560 udp_sendmsg+0x1199/0x1250 net/ipv4/udp.c:1260 inet_sendmsg+0x63/0x80 net/ipv4/af_inet.c:830 sock_sendmsg_nosec net/socket.c:725 [inline] sock_sendmsg net/socket.c:748 [inline] ____sys_sendmsg+0x37c/0x4d0 net/socket.c:2494 ___sys_sendmsg net/socket.c:2548 [inline] __sys_sendmmsg+0x269/0x500 net/socket.c:2634 __do_sys_sendmmsg net/socket.c:2663 [inline] __se_sys_sendmmsg net/socket.c:2660 [inline] __x64_sys_sendmmsg+0x57/0x60 net/socket.c:2660 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd value changed: 0x184d -> 0x184e Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 7804 Comm: syz-executor.1 Not tainted 6.5.0-rc6-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/26/2023 ================================================================== Fixes: 23f57406b82d ("ipv4: avoid using shared IP generator for connected sockets") Reported-by: syzbot <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>