aboutsummaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)AuthorFilesLines
2022-01-15hugetlb: add hugetlb.*.numa_stat fileMina Almasry2-2/+9
For hugetlb backed jobs/VMs it's critical to understand the numa information for the memory backing these jobs to deliver optimal performance. Currently this technically can be queried from /proc/self/numa_maps, but there are significant issues with that. Namely: 1. Memory can be mapped or unmapped. 2. numa_maps are per process and need to be aggregated across all processes in the cgroup. For shared memory this is more involved as the userspace needs to make sure it doesn't double count shared mappings. 3. I believe querying numa_maps needs to hold the mmap_lock which adds to the contention on this lock. For these reasons I propose simply adding hugetlb.*.numa_stat file, which shows the numa information of the cgroup similarly to memory.numa_stat. On cgroup-v2: cat /sys/fs/cgroup/unified/test/hugetlb.2MB.numa_stat total=2097152 N0=2097152 N1=0 On cgroup-v1: cat /sys/fs/cgroup/hugetlb/test/hugetlb.2MB.numa_stat total=2097152 N0=2097152 N1=0 hierarichal_total=2097152 N0=2097152 N1=0 This patch was tested manually by allocating hugetlb memory and querying the hugetlb.*.numa_stat file of the cgroup and its parents. [[email protected]: fix spelling mistake "hierarichal" -> "hierarchical"] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: fix copy/paste array assignment] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mina Almasry <[email protected]> Signed-off-by: Colin Ian King <[email protected]> Signed-off-by: Kees Cook <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Reviewed-by: Muchun Song <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Michal Hocko <[email protected]> Cc: David Rientjes <[email protected]> Cc: Jue Wang <[email protected]> Cc: Yang Yao <[email protected]> Cc: Joanna Li <[email protected]> Cc: Cannon Matthews <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-01-15mm_zone: add function to check if managed dma zone existsBaoquan He1-0/+9
Patch series "Handle warning of allocation failure on DMA zone w/o managed pages", v4. **Problem observed: On x86_64, when crash is triggered and entering into kdump kernel, page allocation failure can always be seen. --------------------------------- DMA: preallocated 128 KiB GFP_KERNEL pool for atomic allocations swapper/0: page allocation failure: order:5, mode:0xcc1(GFP_KERNEL|GFP_DMA), nodemask=(null),cpuset=/,mems_allowed=0 CPU: 0 PID: 1 Comm: swapper/0 Call Trace: dump_stack+0x7f/0xa1 warn_alloc.cold+0x72/0xd6 ...... __alloc_pages+0x24d/0x2c0 ...... dma_atomic_pool_init+0xdb/0x176 do_one_initcall+0x67/0x320 ? rcu_read_lock_sched_held+0x3f/0x80 kernel_init_freeable+0x290/0x2dc ? rest_init+0x24f/0x24f kernel_init+0xa/0x111 ret_from_fork+0x22/0x30 Mem-Info: ------------------------------------ ***Root cause: In the current kernel, it assumes that DMA zone must have managed pages and try to request pages if CONFIG_ZONE_DMA is enabled. While this is not always true. E.g in kdump kernel of x86_64, only low 1M is presented and locked down at very early stage of boot, so that this low 1M won't be added into buddy allocator to become managed pages of DMA zone. This exception will always cause page allocation failure if page is requested from DMA zone. ***Investigation: This failure happens since below commit merged into linus's tree. 1a6a9044b967 x86/setup: Remove CONFIG_X86_RESERVE_LOW and reservelow= options 23721c8e92f7 x86/crash: Remove crash_reserve_low_1M() f1d4d47c5851 x86/setup: Always reserve the first 1M of RAM 7c321eb2b843 x86/kdump: Remove the backup region handling 6f599d84231f x86/kdump: Always reserve the low 1M when the crashkernel option is specified Before them, on x86_64, the low 640K area will be reused by kdump kernel. So in kdump kernel, the content of low 640K area is copied into a backup region for dumping before jumping into kdump. Then except of those firmware reserved region in [0, 640K], the left area will be added into buddy allocator to become available managed pages of DMA zone. However, after above commits applied, in kdump kernel of x86_64, the low 1M is reserved by memblock, but not released to buddy allocator. So any later page allocation requested from DMA zone will fail. At the beginning, if crashkernel is reserved, the low 1M need be locked down because AMD SME encrypts memory making the old backup region mechanims impossible when switching into kdump kernel. Later, it was also observed that there are BIOSes corrupting memory under 1M. To solve this, in commit f1d4d47c5851, the entire region of low 1M is always reserved after the real mode trampoline is allocated. Besides, recently, Intel engineer mentioned their TDX (Trusted domain extensions) which is under development in kernel also needs to lock down the low 1M. So we can't simply revert above commits to fix the page allocation failure from DMA zone as someone suggested. ***Solution: Currently, only DMA atomic pool and dma-kmalloc will initialize and request page allocation with GFP_DMA during bootup. So only initializ DMA atomic pool when DMA zone has available managed pages, otherwise just skip the initialization. For dma-kmalloc(), for the time being, let's mute the warning of allocation failure if requesting pages from DMA zone while no manged pages. Meanwhile, change code to use dma_alloc_xx/dma_map_xx API to replace kmalloc(GFP_DMA), or do not use GFP_DMA when calling kmalloc() if not necessary. Christoph is posting patches to fix those under drivers/scsi/. Finally, we can remove the need of dma-kmalloc() as people suggested. This patch (of 3): In some places of the current kernel, it assumes that dma zone must have managed pages if CONFIG_ZONE_DMA is enabled. While this is not always true. E.g in kdump kernel of x86_64, only low 1M is presented and locked down at very early stage of boot, so that there's no managed pages at all in DMA zone. This exception will always cause page allocation failure if page is requested from DMA zone. Here add function has_managed_dma() and the relevant helper functions to check if there's DMA zone with managed pages. It will be used in later patches. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: 6f599d84231f ("x86/kdump: Always reserve the low 1M when the crashkernel option is specified") Signed-off-by: Baoquan He <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Acked-by: John Donnelly <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Hyeonggon Yoo <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: David Laight <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Marek Szyprowski <[email protected]> Cc: Robin Murphy <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-01-15include/linux/gfp.h: further document GFP_DMA32Miles Chen1-1/+3
kmalloc(..., GFP_DMA32) does not return DMA32 memory because the DMA32 kmalloc cache array is not implemented. (Reason: there is no such user in kernel). Put a short comment about this so people can understand this by reading the comment. [1] https://lists.linuxfoundation.org/pipermail/iommu/2018-December/031696.html Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miles Chen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-01-15mm: drop node from alloc_pages_vmaMichal Hocko1-4/+4
alloc_pages_vma is meant to allocate a page with a vma specific memory policy. The initial node parameter is always a local node so it is pointless to waste a function argument for this. Drop the parameter. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Ben Widawsky <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Feng Tang <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Dan Williams <[email protected]> Cc: "Huang, Ying" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-01-15mm: fix boolreturn.cocci warningChangcheng Deng1-1/+1
Return statements in functions returning bool should use true/false instead of 1/0. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Changcheng Deng <[email protected]> Reported-by: Zeal Robot <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-01-15mm: introduce memalloc_retry_wait()NeilBrown1-0/+26
Various places in the kernel - largely in filesystems - respond to a memory allocation failure by looping around and re-trying. Some of these cannot conveniently use __GFP_NOFAIL, for reasons such as: - a GFP_ATOMIC allocation, which __GFP_NOFAIL doesn't work on - a need to check for the process being signalled between failures - the possibility that other recovery actions could be performed - the allocation is quite deep in support code, and passing down an extra flag to say if __GFP_NOFAIL is wanted would be clumsy. Many of these currently use congestion_wait() which (in almost all cases) simply waits the given timeout - congestion isn't tracked for most devices. It isn't clear what the best delay is for loops, but it is clear that the various filesystems shouldn't be responsible for choosing a timeout. This patch introduces memalloc_retry_wait() with takes on that responsibility. Code that wants to retry a memory allocation can call this function passing the GFP flags that were used. It will wait however is appropriate. For now, it only considers __GFP_NORETRY and whatever gfpflags_allow_blocking() tests. If blocking is allowed without __GFP_NORETRY, then alloc_page either made some reclaim progress, or waited for a while, before failing. So there is no need for much further waiting. memalloc_retry_wait() will wait until the current jiffie ends. If this condition is not met, then alloc_page() won't have waited much if at all. In that case memalloc_retry_wait() waits about 200ms. This is the delay that most current loops uses. linux/sched/mm.h needs to be included in some files now, but linux/backing-dev.h does not. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: NeilBrown <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: "Theodore Ts'o" <[email protected]> Cc: Jaegeuk Kim <[email protected]> Cc: Chao Yu <[email protected]> Cc: Darrick J. Wong <[email protected]> Cc: Chuck Lever <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-01-15mm: allow !GFP_KERNEL allocations for kvmallocMichal Hocko1-1/+0
Support for GFP_NO{FS,IO} and __GFP_NOFAIL has been implemented by previous patches so we can allow the support for kvmalloc. This will allow some external users to simplify or completely remove their helpers. GFP_NOWAIT semantic hasn't been supported so far but it hasn't been explicitly documented so let's add a note about that. ceph_kvmalloc is the first helper to be dropped and changed to kvmalloc. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Reviewed-by: Uladzislau Rezki (Sony) <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Ilya Dryomov <[email protected]> Cc: Jeff Layton <[email protected]> Cc: Neil Brown <[email protected]> Cc: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-01-15mm: remove the total_mapcount argument from page_trans_huge_mapcount()Matthew Wilcox (Oracle)2-8/+4
All callers pass NULL, so we can stop calculating the value we would store in it. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: William Kucharski <[email protected]> Acked-by: Linus Torvalds <[email protected]> Cc: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-01-15mm: remove last argument of reuse_swap_page()Matthew Wilcox (Oracle)1-3/+3
None of the callers care about the total_map_swapcount() any more. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Acked-by: Linus Torvalds <[email protected]> Reviewed-by: William Kucharski <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-01-15mm: page table checkPasha Tatashin1-0/+147
Check user page table entries at the time they are added and removed. Allows to synchronously catch memory corruption issues related to double mapping. When a pte for an anonymous page is added into page table, we verify that this pte does not already point to a file backed page, and vice versa if this is a file backed page that is being added we verify that this page does not have an anonymous mapping We also enforce that read-only sharing for anonymous pages is allowed (i.e. cow after fork). All other sharing must be for file pages. Page table check allows to protect and debug cases where "struct page" metadata became corrupted for some reason. For example, when refcnt or mapcount become invalid. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Pasha Tatashin <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Rientjes <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: Greg Thelen <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jiri Slaby <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Kees Cook <[email protected]> Cc: Masahiro Yamada <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Muchun Song <[email protected]> Cc: Paul Turner <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Sami Tolvanen <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Wei Xu <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-01-15mm: ptep_clear() page table helperPasha Tatashin1-0/+8
We have ptep_get_and_clear() and ptep_get_and_clear_full() helpers to clear PTE from user page tables, but there is no variant for simple clear of a present PTE from user page tables without using a low level pte_clear() which can be either native or para-virtualised. Add a new ptep_clear() that can be used in common code to clear PTEs from page table. We will need this call later in order to add a hook for page table check. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Pasha Tatashin <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Rientjes <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: Greg Thelen <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jiri Slaby <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Kees Cook <[email protected]> Cc: Masahiro Yamada <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Muchun Song <[email protected]> Cc: Paul Turner <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Sami Tolvanen <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Wei Xu <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-01-15mm: document locking restrictions for vm_operations_struct::closeSuren Baghdasaryan1-0/+4
Add comments for vm_operations_struct::close documenting locking requirements for this callback and its callers. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Suren Baghdasaryan <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Rientjes <[email protected]> Cc: Florian Weimer <[email protected]> Cc: Jan Engelhardt <[email protected]> Cc: Jann Horn <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Tim Murray <[email protected]> Cc: Jason Gunthorpe <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-01-15mm: move tlb_flush_pending inline helpers to mm_inline.hArnd Bergmann3-129/+131
linux/mm_types.h should only define structure definitions, to make it cheap to include elsewhere. The atomic_t helper function definitions are particularly large, so it's better to move the helpers using those into the existing linux/mm_inline.h and only include that where needed. As a follow-up, we may want to go through all the indirect includes in mm_types.h and reduce them as much as possible. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Arnd Bergmann <[email protected]> Cc: Al Viro <[email protected]> Cc: Stephen Rothwell <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Colin Cross <[email protected]> Cc: Kees Cook <[email protected]> Cc: Peter Xu <[email protected]> Cc: Peter Zijlstra (Intel) <[email protected]> Cc: Yu Zhao <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Eric Biederman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-01-15mm: move anon_vma declarations to linux/mm_inline.hArnd Bergmann2-48/+50
The patch to add anonymous vma names causes a build failure in some configurations: include/linux/mm_types.h: In function 'is_same_vma_anon_name': include/linux/mm_types.h:924:37: error: implicit declaration of function 'strcmp' [-Werror=implicit-function-declaration] 924 | return name && vma_name && !strcmp(name, vma_name); | ^~~~~~ include/linux/mm_types.h:22:1: note: 'strcmp' is defined in header '<string.h>'; did you forget to '#include <string.h>'? This should not really be part of linux/mm_types.h in the first place, as that header is meant to only contain structure defintions and need a minimum set of indirect includes itself. While the header clearly includes more than it should at this point, let's not make it worse by including string.h as well, which would pull in the expensive (compile-speed wise) fortify-string logic. Move the new functions into a separate header that only needs to be included in a couple of locations. Link: https://lkml.kernel.org/r/[email protected] Fixes: "mm: add a field to store names for private anonymous memory" Signed-off-by: Arnd Bergmann <[email protected]> Cc: Al Viro <[email protected]> Cc: Colin Cross <[email protected]> Cc: Eric Biederman <[email protected]> Cc: Kees Cook <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Peter Xu <[email protected]> Cc: Peter Zijlstra (Intel) <[email protected]> Cc: Stephen Rothwell <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-01-15mm: add anonymous vma name refcountingSuren Baghdasaryan1-1/+8
While forking a process with high number (64K) of named anonymous vmas the overhead caused by strdup() is noticeable. Experiments with ARM64 Android device show up to 40% performance regression when forking a process with 64k unpopulated anonymous vmas using the max name lengths vs the same process with the same number of anonymous vmas having no name. Introduce anon_vma_name refcounted structure to avoid the overhead of copying vma names during fork() and when splitting named anonymous vmas. When a vma is duplicated, instead of copying the name we increment the refcount of this structure. Multiple vmas can point to the same anon_vma_name as long as they increment the refcount. The name member of anon_vma_name structure is assigned at structure allocation time and is never changed. If vma name changes then the refcount of the original structure is dropped, a new anon_vma_name structure is allocated to hold the new name and the vma pointer is updated to point to the new structure. With this approach the fork() performance regressions is reduced 3-4x times and with usecases using more reasonable number of VMAs (a few thousand) the regressions is not measurable. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Suren Baghdasaryan <[email protected]> Reviewed-by: Kees Cook <[email protected]> Cc: Al Viro <[email protected]> Cc: Colin Cross <[email protected]> Cc: Cyrill Gorcunov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Rientjes <[email protected]> Cc: "Eric W. Biederman" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jan Glauber <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: John Stultz <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rob Landley <[email protected]> Cc: "Serge E. Hallyn" <[email protected]> Cc: Shaohua Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-01-15mm: add a field to store names for private anonymous memoryColin Cross3-5/+75
In many userspace applications, and especially in VM based applications like Android uses heavily, there are multiple different allocators in use. At a minimum there is libc malloc and the stack, and in many cases there are libc malloc, the stack, direct syscalls to mmap anonymous memory, and multiple VM heaps (one for small objects, one for big objects, etc.). Each of these layers usually has its own tools to inspect its usage; malloc by compiling a debug version, the VM through heap inspection tools, and for direct syscalls there is usually no way to track them. On Android we heavily use a set of tools that use an extended version of the logic covered in Documentation/vm/pagemap.txt to walk all pages mapped in userspace and slice their usage by process, shared (COW) vs. unique mappings, backing, etc. This can account for real physical memory usage even in cases like fork without exec (which Android uses heavily to share as many private COW pages as possible between processes), Kernel SamePage Merging, and clean zero pages. It produces a measurement of the pages that only exist in that process (USS, for unique), and a measurement of the physical memory usage of that process with the cost of shared pages being evenly split between processes that share them (PSS). If all anonymous memory is indistinguishable then figuring out the real physical memory usage (PSS) of each heap requires either a pagemap walking tool that can understand the heap debugging of every layer, or for every layer's heap debugging tools to implement the pagemap walking logic, in which case it is hard to get a consistent view of memory across the whole system. Tracking the information in userspace leads to all sorts of problems. It either needs to be stored inside the process, which means every process has to have an API to export its current heap information upon request, or it has to be stored externally in a filesystem that somebody needs to clean up on crashes. It needs to be readable while the process is still running, so it has to have some sort of synchronization with every layer of userspace. Efficiently tracking the ranges requires reimplementing something like the kernel vma trees, and linking to it from every layer of userspace. It requires more memory, more syscalls, more runtime cost, and more complexity to separately track regions that the kernel is already tracking. This patch adds a field to /proc/pid/maps and /proc/pid/smaps to show a userspace-provided name for anonymous vmas. The names of named anonymous vmas are shown in /proc/pid/maps and /proc/pid/smaps as [anon:<name>]. Userspace can set the name for a region of memory by calling prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name) Setting the name to NULL clears it. The name length limit is 80 bytes including NUL-terminator and is checked to contain only printable ascii characters (including space), except '[',']','\','$' and '`'. Ascii strings are being used to have a descriptive identifiers for vmas, which can be understood by the users reading /proc/pid/maps or /proc/pid/smaps. Names can be standardized for a given system and they can include some variable parts such as the name of the allocator or a library, tid of the thread using it, etc. The name is stored in a pointer in the shared union in vm_area_struct that points to a null terminated string. Anonymous vmas with the same name (equivalent strings) and are otherwise mergeable will be merged. The name pointers are not shared between vmas even if they contain the same name. The name pointer is stored in a union with fields that are only used on file-backed mappings, so it does not increase memory usage. CONFIG_ANON_VMA_NAME kernel configuration is introduced to enable this feature. It keeps the feature disabled by default to prevent any additional memory overhead and to avoid confusing procfs parsers on systems which are not ready to support named anonymous vmas. The patch is based on the original patch developed by Colin Cross, more specifically on its latest version [1] posted upstream by Sumit Semwal. It used a userspace pointer to store vma names. In that design, name pointers could be shared between vmas. However during the last upstreaming attempt, Kees Cook raised concerns [2] about this approach and suggested to copy the name into kernel memory space, perform validity checks [3] and store as a string referenced from vm_area_struct. One big concern is about fork() performance which would need to strdup anonymous vma names. Dave Hansen suggested experimenting with worst-case scenario of forking a process with 64k vmas having longest possible names [4]. I ran this experiment on an ARM64 Android device and recorded a worst-case regression of almost 40% when forking such a process. This regression is addressed in the followup patch which replaces the pointer to a name with a refcounted structure that allows sharing the name pointer between vmas of the same name. Instead of duplicating the string during fork() or when splitting a vma it increments the refcount. [1] https://lore.kernel.org/linux-mm/[email protected]/ [2] https://lore.kernel.org/linux-mm/202009031031.D32EF57ED@keescook/ [3] https://lore.kernel.org/linux-mm/202009031022.3834F692@keescook/ [4] https://lore.kernel.org/linux-mm/[email protected]/ Changes for prctl(2) manual page (in the options section): PR_SET_VMA Sets an attribute specified in arg2 for virtual memory areas starting from the address specified in arg3 and spanning the size specified in arg4. arg5 specifies the value of the attribute to be set. Note that assigning an attribute to a virtual memory area might prevent it from being merged with adjacent virtual memory areas due to the difference in that attribute's value. Currently, arg2 must be one of: PR_SET_VMA_ANON_NAME Set a name for anonymous virtual memory areas. arg5 should be a pointer to a null-terminated string containing the name. The name length including null byte cannot exceed 80 bytes. If arg5 is NULL, the name of the appropriate anonymous virtual memory areas will be reset. The name can contain only printable ascii characters (including space), except '[',']','\','$' and '`'. This feature is available only if the kernel is built with the CONFIG_ANON_VMA_NAME option enabled. [[email protected]: docs: proc.rst: /proc/PID/maps: fix malformed table] Link: https://lkml.kernel.org/r/[email protected] [surenb: rebased over v5.15-rc6, replaced userpointer with a kernel copy, added input sanitization and CONFIG_ANON_VMA_NAME config. The bulk of the work here was done by Colin Cross, therefore, with his permission, keeping him as the author] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Colin Cross <[email protected]> Signed-off-by: Suren Baghdasaryan <[email protected]> Reviewed-by: Kees Cook <[email protected]> Cc: Stephen Rothwell <[email protected]> Cc: Al Viro <[email protected]> Cc: Cyrill Gorcunov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Rientjes <[email protected]> Cc: "Eric W. Biederman" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jan Glauber <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: John Stultz <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rob Landley <[email protected]> Cc: "Serge E. Hallyn" <[email protected]> Cc: Shaohua Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-01-15memcg: add per-memcg vmalloc statShakeel Butt1-0/+21
The kvmalloc* allocation functions can fallback to vmalloc allocations and more often on long running machines. In addition the kernel does have __GFP_ACCOUNT kvmalloc* calls. So, often on long running machines, the memory.stat does not tell the complete picture which type of memory is charged to the memcg. So add a per-memcg vmalloc stat. [[email protected]: page_memcg() within rcu lock, per Muchun] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: remove cast, per Muchun] [[email protected]: remove area->page[0] checks and move to page by page accounting per Michal] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Shakeel Butt <[email protected]> Acked-by: Roman Gushchin <[email protected]> Reviewed-by: Muchun Song <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-01-15mm/memcg: add oom_group_kill memory eventDan Schatzberg1-0/+1
Our container agent wants to know when a container exits if it was OOM killed or not to report to the user. We use memory.oom.group = 1 to ensure that OOM kills within the container's cgroup kill everything. Existing memory.events are insufficient for knowing if this triggered: 1) Our current approach reads memory.events oom_kill and reports the container was killed if the value is non-zero. This is erroneous in some cases where containers create their children cgroups with memory.oom.group=1 as such OOM kills will get counted against the container cgroup's oom_kill counter despite not actually OOM killing the entire container. 2) Reading memory.events.local will fail to identify OOM kills in leaf cgroups (that don't set memory.oom.group) within the container cgroup. This patch adds a new oom_group_kill event when memory.oom.group triggers to allow userspace to cleanly identify when an entire cgroup is oom killed. [[email protected]: changes from Johannes and Chris] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Dan Schatzberg <[email protected]> Reviewed-by: Roman Gushchin <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Chris Down <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Zefan Li <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Muchun Song <[email protected]> Cc: Alex Shi <[email protected]> Cc: Wei Yang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-01-15mm,fs: split dump_mapping() out from dump_page()Matthew Wilcox (Oracle)1-0/+1
dump_mapping() is a big chunk of dump_page(), and it'd be handy to be able to call it when we don't have a struct page. Split it out and move it to fs/inode.c. Take the opportunity to simplify some of the debug messages a little. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: William Kucharski <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-01-15mm/memremap: add ZONE_DEVICE support for compound pagesJoao Martins1-0/+11
Add a new @vmemmap_shift property for struct dev_pagemap which specifies that a devmap is composed of a set of compound pages of order @vmemmap_shift, instead of base pages. When a compound page devmap is requested, all but the first page are initialised as tail pages instead of order-0 pages. For certain ZONE_DEVICE users like device-dax which have a fixed page size, this creates an opportunity to optimize GUP and GUP-fast walkers, treating it the same way as THP or hugetlb pages. Additionally, commit 7118fc2906e2 ("hugetlb: address ref count racing in prep_compound_gigantic_page") removed set_page_count() because the setting of page ref count to zero was redundant. devmap pages don't come from page allocator though and only head page refcount is used for compound pages, hence initialize tail page count to zero. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Joao Martins <[email protected]> Reviewed-by: Dan Williams <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Dave Jiang <[email protected]> Cc: Jane Chu <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: John Hubbard <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Muchun Song <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Vishal Verma <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-01-15mm: defer kmemleak object creation of module_alloc()Kefeng Wang2-2/+9
Yongqiang reports a kmemleak panic when module insmod/rmmod with KASAN enabled(without KASAN_VMALLOC) on x86[1]. When the module area allocates memory, it's kmemleak_object is created successfully, but the KASAN shadow memory of module allocation is not ready, so when kmemleak scan the module's pointer, it will panic due to no shadow memory with KASAN check. module_alloc __vmalloc_node_range kmemleak_vmalloc kmemleak_scan update_checksum kasan_module_alloc kmemleak_ignore Note, there is no problem if KASAN_VMALLOC enabled, the modules area entire shadow memory is preallocated. Thus, the bug only exits on ARCH which supports dynamic allocation of module area per module load, for now, only x86/arm64/s390 are involved. Add a VM_DEFER_KMEMLEAK flags, defer vmalloc'ed object register of kmemleak in module_alloc() to fix this issue. [1] https://lore.kernel.org/all/[email protected]/ [[email protected]: fix build] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: simplify ifdefs, per Andrey] Link: https://lkml.kernel.org/r/CA+fCnZcnwJHUQq34VuRxpdoY6_XbJCDJ-jopksS5Eia4PijPzw@mail.gmail.com Link: https://lkml.kernel.org/r/[email protected] Fixes: 793213a82de4 ("s390/kasan: dynamic shadow mem allocation for modules") Fixes: 39d114ddc682 ("arm64: add KASAN support") Fixes: bebf56a1b176 ("kasan: enable instrumentation of global variables") Signed-off-by: Kefeng Wang <[email protected]> Reported-by: Yongqiang Liu <[email protected]> Cc: Andrey Konovalov <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Will Deacon <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Kefeng Wang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-01-15kthread: add the helper function kthread_run_on_cpu()Cai Huoqing1-0/+25
Add a new helper function kthread_run_on_cpu(), which includes kthread_create_on_cpu/wake_up_process(). In some cases, use kthread_run_on_cpu() directly instead of kthread_create_on_node/kthread_bind/wake_up_process() or kthread_create_on_cpu/wake_up_process() or kthreadd_create/kthread_bind/wake_up_process() to simplify the code. [[email protected]: export kthread_create_on_cpu to modules] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Cai Huoqing <[email protected]> Cc: Bernard Metzler <[email protected]> Cc: Cai Huoqing <[email protected]> Cc: Daniel Bristot de Oliveira <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Doug Ledford <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Joel Fernandes (Google) <[email protected]> Cc: Josh Triplett <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: "Paul E . McKenney" <[email protected]> Cc: Steven Rostedt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-01-14vdpa: Avoid taking cf_mutex lock on get statusEli Cohen1-1/+0
Avoid the wrapper holding cf_mutex since it is not protecting anything. To avoid confusion and unnecessary overhead incurred by it, remove. Fixes: f489f27bc0ab ("vdpa: Sync calls set/get config/status with cf_mutex") Signed-off-by: Eli Cohen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Michael S. Tsirkin <[email protected]> Reviewed-by: Si-Wei Liu<[email protected]> Acked-by: Jason Wang <[email protected]>
2022-01-14vdpa: Support reporting max device capabilitiesEli Cohen2-0/+4
Add max_supported_vqs and supported_features fields to struct vdpa_mgmt_dev. Upstream drivers need to feel these values according to the device capabilities. These values are reported back in a netlink message when showing management devices. Examples: $ auxiliary/mlx5_core.sf.1: supported_classes net max_supported_vqs 257 dev_features CSUM GUEST_CSUM MTU HOST_TSO4 HOST_TSO6 STATUS CTRL_VQ MQ \ CTRL_MAC_ADDR VERSION_1 ACCESS_PLATFORM $ vdpa -j mgmtdev show {"mgmtdev":{"auxiliary/mlx5_core.sf.1":{"supported_classes":["net"], \ "max_supported_vqs":257,"dev_features":["CSUM","GUEST_CSUM","MTU", \ "HOST_TSO4","HOST_TSO6","STATUS","CTRL_VQ","MQ","CTRL_MAC_ADDR", \ "VERSION_1","ACCESS_PLATFORM"]}}} $ vdpa -jp mgmtdev show { "mgmtdev": { "auxiliary/mlx5_core.sf.1": { "supported_classes": [ "net" ], "max_supported_vqs": 257, "dev_features": ["CSUM","GUEST_CSUM","MTU","HOST_TSO4", \ "HOST_TSO6","STATUS","CTRL_VQ","MQ", \ "CTRL_MAC_ADDR","VERSION_1","ACCESS_PLATFORM"] } } } Signed-off-by: Eli Cohen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Michael S. Tsirkin <[email protected]> Reviewed-by: Si-Wei Liu<[email protected]>
2022-01-14vdpa: Add support for returning device configuration informationEli Cohen1-0/+4
Add netlink attribute to store the negotiated features. This can be used by userspace to get the current state of the vdpa instance. Examples: $ vdpa dev config show vdpa-a vdpa-a: mac 00:00:00:00:88:88 link up link_announce false max_vq_pairs 16 mtu 1500 negotiated_features CSUM GUEST_CSUM MTU MAC HOST_TSO4 HOST_TSO6 STATUS \ CTRL_VQ MQ CTRL_MAC_ADDR VERSION_1 ACCESS_PLATFORM $ vdpa -j dev config show vdpa-a {"config":{"vdpa-a":{"mac":"00:00:00:00:88:88","link ":"up","link_announce":false, \ "max_vq_pairs":16,"mtu":1500,"negotiated_features":["CSUM","GUEST_CSUM","MTU","MAC", \ "HOST_TSO4","HOST_TSO6","STATUS","CTRL_VQ","MQ","CTRL_MAC_ADDR","VERSION_1", \ "ACCESS_PLATFORM"]}}} $ vdpa -jp dev config show vdpa-a { "config": { "vdpa-a": { "mac": "00:00:00:00:88:88", "link ": "up", "link_announce ": false, "max_vq_pairs": 16, "mtu": 1500, "negotiated_features": [ "CSUM","GUEST_CSUM","MTU","MAC","HOST_TSO4","HOST_TSO6","STATUS","CTRL_VQ","MQ", \ "CTRL_MAC_ADDR","VERSION_1","ACCESS_PLATFORM" ] } } } Signed-off-by: Eli Cohen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Michael S. Tsirkin <[email protected]> Acked-by: Jason Wang <[email protected]>
2022-01-14vdpa: Allow to configure max data virtqueuesEli Cohen1-3/+16
Add netlink support to configure the max virtqueue pairs for a device. At least one pair is required. The maximum is dictated by the device. Example: $ vdpa dev add name vdpa-a mgmtdev auxiliary/mlx5_core.sf.1 max_vqp 4 Signed-off-by: Eli Cohen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Michael S. Tsirkin <[email protected]>
2022-01-14vdpa: Sync calls set/get config/status with cf_mutexEli Cohen1-0/+3
Add wrappers to get/set status and protect these operations with cf_mutex to serialize these operations with respect to get/set config operations. Signed-off-by: Eli Cohen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Michael S. Tsirkin <[email protected]>
2022-01-14vdpa: Provide interface to read driver featuresEli Cohen1-5/+9
Provide an interface to read the negotiated features. This is needed when building the netlink message in vdpa_dev_net_config_fill(). Also fix the implementation of vdpa_dev_net_config_fill() to use the negotiated features instead of the device features. To make APIs clearer, make the following name changes to struct vdpa_config_ops so they better describe their operations: get_features -> get_device_features set_features -> set_driver_features Finally, add get_driver_features to return the negotiated features and add implementation to all the upstream drivers. Acked-by: Jason Wang <[email protected]> Signed-off-by: Eli Cohen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Michael S. Tsirkin <[email protected]>
2022-01-14vdpa: Mark vdpa_config_ops.get_vq_notification as optionalEugenio Pérez1-1/+1
Since vhost_vdpa_mmap checks for its existence before calling it. Signed-off-by: Eugenio Pérez <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Michael S. Tsirkin <[email protected]> Acked-by: Jason Wang <[email protected]> Reviewed-by: Stefano Garzarella <[email protected]>
2022-01-14vdpa: add driver_override supportStefano Garzarella1-0/+2
`driver_override` allows to control which of the vDPA bus drivers binds to a vDPA device. If `driver_override` is not set, the previous behaviour is followed: devices use the first vDPA bus driver loaded (unless auto binding is disabled). Tested on Fedora 34 with driverctl(8): $ modprobe virtio-vdpa $ modprobe vhost-vdpa $ modprobe vdpa-sim-net $ vdpa dev add mgmtdev vdpasim_net name dev1 # dev1 is attached to the first vDPA bus driver loaded $ driverctl -b vdpa list-devices dev1 virtio_vdpa $ driverctl -b vdpa set-override dev1 vhost_vdpa $ driverctl -b vdpa list-devices dev1 vhost_vdpa [*] Note: driverctl(8) integrates with udev so the binding is preserved. Suggested-by: Jason Wang <[email protected]> Acked-by: Jason Wang <[email protected]> Signed-off-by: Stefano Garzarella <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Michael S. Tsirkin <[email protected]>
2022-01-14virtio: wrap config->reset callsMichael S. Tsirkin1-0/+1
This will enable cleanups down the road. The idea is to disable cbs, then add "flush_queued_cbs" callback as a parameter, this way drivers can flush any work queued after callbacks have been disabled. Signed-off-by: Michael S. Tsirkin <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Michael S. Tsirkin <[email protected]>
2022-01-14kvm: x86: Add support for getting/setting expanded xstate bufferGuang Zeng1-0/+4
With KVM_CAP_XSAVE, userspace uses a hardcoded 4KB buffer to get/set xstate data from/to KVM. This doesn't work when dynamic xfeatures (e.g. AMX) are exposed to the guest as they require a larger buffer size. Introduce a new capability (KVM_CAP_XSAVE2). Userspace VMM gets the required xstate buffer size via KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2). KVM_SET_XSAVE is extended to work with both legacy and new capabilities by doing properly-sized memdup_user() based on the guest fpu container. KVM_GET_XSAVE is kept for backward-compatible reason. Instead, KVM_GET_XSAVE2 is introduced under KVM_CAP_XSAVE2 as the preferred interface for getting xstate buffer (4KB or larger size) from KVM (Link: https://lkml.org/lkml/2021/12/15/510) Also, update the api doc with the new KVM_GET_XSAVE2 ioctl. Signed-off-by: Guang Zeng <[email protected]> Signed-off-by: Wei Wang <[email protected]> Signed-off-by: Jing Liu <[email protected]> Signed-off-by: Kevin Tian <[email protected]> Signed-off-by: Yang Zhong <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-01-14Merge tag 'char-misc-5.17-rc1' of ↵Linus Torvalds33-225/+3898
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull char/misc and other driver updates from Greg KH: "Here is the large set of char, misc, and other "small" driver subsystem changes for 5.17-rc1. Lots of different things are in here for char/misc drivers such as: - habanalabs driver updates - mei driver updates - lkdtm driver updates - vmw_vmci driver updates - android binder driver updates - other small char/misc driver updates Also smaller driver subsystems have also been updated, including: - fpga subsystem updates - iio subsystem updates - soundwire subsystem updates - extcon subsystem updates - gnss subsystem updates - phy subsystem updates - coresight subsystem updates - firmware subsystem updates - comedi subsystem updates - mhi subsystem updates - speakup subsystem updates - rapidio subsystem updates - spmi subsystem updates - virtual driver updates - counter subsystem updates Too many individual changes to summarize, the shortlog contains the full details. All of these have been in linux-next for a while with no reported issues" * tag 'char-misc-5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (406 commits) counter: 104-quad-8: Fix use-after-free by quad8_irq_handler dt-bindings: mux: Document mux-states property dt-bindings: ti-serdes-mux: Add defines for J721S2 SoC counter: remove old and now unused registration API counter: ti-eqep: Convert to new counter registration counter: stm32-lptimer-cnt: Convert to new counter registration counter: stm32-timer-cnt: Convert to new counter registration counter: microchip-tcb-capture: Convert to new counter registration counter: ftm-quaddec: Convert to new counter registration counter: intel-qep: Convert to new counter registration counter: interrupt-cnt: Convert to new counter registration counter: 104-quad-8: Convert to new counter registration counter: Update documentation for new counter registration functions counter: Provide alternative counter registration functions counter: stm32-timer-cnt: Convert to counter_priv() wrapper counter: stm32-lptimer-cnt: Convert to counter_priv() wrapper counter: ti-eqep: Convert to counter_priv() wrapper counter: ftm-quaddec: Convert to counter_priv() wrapper counter: intel-qep: Convert to counter_priv() wrapper counter: microchip-tcb-capture: Convert to counter_priv() wrapper ...
2022-01-14Merge tag 'powerpc-5.17-1' of ↵Linus Torvalds3-3/+6
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: - Optimise radix KVM guest entry/exit by 2x on Power9/Power10. - Allow firmware to tell us whether to disable the entry and uaccess flushes on Power10 or later CPUs. - Add BPF_PROBE_MEM support for 32 and 64-bit BPF jits. - Several fixes and improvements to our hard lockup watchdog. - Activate HAVE_DYNAMIC_FTRACE_WITH_REGS on 32-bit. - Allow building the 64-bit Book3S kernel without hash MMU support, ie. Radix only. - Add KUAP (SMAP) support for 40x, 44x, 8xx, Book3E (64-bit). - Add new encodings for perf_mem_data_src.mem_hops field, and use them on Power10. - A series of small performance improvements to 64-bit interrupt entry. - Several commits fixing issues when building with the clang integrated assembler. - Many other small features and fixes. Thanks to Alan Modra, Alexey Kardashevskiy, Ammar Faizi, Anders Roxell, Arnd Bergmann, Athira Rajeev, Cédric Le Goater, Christophe JAILLET, Christophe Leroy, Christoph Hellwig, Daniel Axtens, David Yang, Erhard Furtner, Fabiano Rosas, Greg Kroah-Hartman, Guo Ren, Hari Bathini, Jason Wang, Joel Stanley, Julia Lawall, Kajol Jain, Kees Cook, Laurent Dufour, Madhavan Srinivasan, Mark Brown, Minghao Chi, Nageswara R Sastry, Naresh Kamboju, Nathan Chancellor, Nathan Lynch, Nicholas Piggin, Nick Child, Oliver O'Halloran, Peiwei Hu, Randy Dunlap, Ravi Bangoria, Rob Herring, Russell Currey, Sachin Sant, Sean Christopherson, Segher Boessenkool, Thadeu Lima de Souza Cascardo, Tyrel Datwyler, Xiang wangx, and Yang Guang. * tag 'powerpc-5.17-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (240 commits) powerpc/xmon: Dump XIVE information for online-only processors. powerpc/opal: use default_groups in kobj_type powerpc/cacheinfo: use default_groups in kobj_type powerpc/sched: Remove unused TASK_SIZE_OF powerpc/xive: Add missing null check after calling kmalloc powerpc/floppy: Remove usage of the deprecated "pci-dma-compat.h" API selftests/powerpc: Add a test of sigreturning to an unaligned address powerpc/64s: Use EMIT_WARN_ENTRY for SRR debug warnings powerpc/64s: Mask NIP before checking against SRR0 powerpc/perf: Fix spelling of "its" powerpc/32: Fix boot failure with GCC latent entropy plugin powerpc/code-patching: Replace patch_instruction() by ppc_inst_write() in selftests powerpc/code-patching: Move code patching selftests in its own file powerpc/code-patching: Move instr_is_branch_{i/b}form() in code-patching.h powerpc/code-patching: Move patch_exception() outside code-patching.c powerpc/code-patching: Use test_trampoline for prefixed patch test powerpc/code-patching: Fix patch_branch() return on out-of-range failure powerpc/code-patching: Reorganise do_patch_instruction() to ease error handling powerpc/code-patching: Fix unmap_patch_area() error handling powerpc/code-patching: Fix error handling in do_patch_instruction() ...
2022-01-14Merge tag 'sound-5.17-rc1' of ↵Linus Torvalds29-72/+1010
git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound Pull sound updates from Takashi Iwai: "It's a relatively calm development cycle, but still lots of updates in the driver side like Intel SOF. Below are some highlights: ALSA / ASoC core: - A new kselftest for ALSA control API - PCM NO_REWINDS support - Potential race fixes around control removals - Unify x86 SG-buffer memory allocation code - Cleanups and race fixes for ASoC DPCM locking ASoC: - Refinements and cleanups around the delay() APIs - Wider use of dev_err_probe(). - Continuing cleanups and improvements to the SOF code - Support for pin switches in simple-card derived cards - Support for AMD Renoir ACP, Asahi Kasei Microdevices AKM4375, Intel systems using NAU8825 and MAX98390, Mediatek MT8915, nVidia Tegra20 S/PDIF, Qualcomm systems using ALC5682I-VS and Texas Instruments TLV320ADC3xxx HD-audio / USB-audio: - Fix deadlock at HD-audio codec unbinding - Fixes for Tegra194 HD-audio, new HDA support for CS35L41 codec - Quirks for Lenovo and HP machines, Gigabyte mobo, Bose device Misc: - Fix virmidi drain behavior Note that the merge of CS35L41 codec support is still half-baked, and at least one ACPI change is missing. Although this won't hinder the kernel build itself, we're going to catch up before RC1" * tag 'sound-5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (415 commits) ALSA: hda: intel-dsp-config: reorder the config table ALSA: hda: intel-dsp-config: add JasperLake support ALSA: hda: cs35l41: fix double free on error in probe() ALSA: hda: Fix dependencies of CS35L41 on SPI/I2C buses ALSA: hda: Fix dependency on ASoC cs35l41 codec ASoC: cs35l41: Add support for hibernate memory retention mode ASoC: cs35l41: Update handling of test key registers ALSA: intel_hdmi: Check for error num after setting mask ASoC: wcd9335: Keep a RX port value for each SLIM RX mux ASoC: amd: acp: acp-mach: Change default RT1019 amp dev id ALSA: virmidi: Remove duplicated code ALSA: seq: virmidi: Add a drain operation ASoC: topology: Fix typo ASoC: fsl_asrc: refine the check of available clock divider ASoC: Intel: bytcr_rt5640: Add support for external GPIO jack-detect ASoC: Intel: bytcr_rt5640: Support retrieving the codec IRQ from the AMCR0F28 ACPI dev ASoC: rt5640: Add support for boards with an external jack-detect GPIO ASoC: rt5640: Allow snd_soc_component_set_jack() to override the codec IRQ ASoC: rt5640: Change jack_work to a delayed_work ASoC: rt5640: Fix possible NULL pointer deref on resume ...
2022-01-14Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsiLinus Torvalds3-7/+3
Pull SCSI updates from James Bottomley: "This series consists of the usual driver updates (ufs, pm80xx, lpfc, mpi3mr, mpt3sas, hisi_sas, libsas) and minor updates and bug fixes. The most impactful change is likely the switch from GFP_DMA to GFP_KERNEL in a bunch of drivers, but even that shouldn't affect too many people" * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (121 commits) scsi: mpi3mr: Bump driver version to 8.0.0.61.0 scsi: mpi3mr: Fixes around reply request queues scsi: mpi3mr: Enhanced Task Management Support Reply handling scsi: mpi3mr: Use TM response codes from MPI3 headers scsi: mpi3mr: Add io_uring interface support in I/O-polled mode scsi: mpi3mr: Print cable mngnt and temp threshold events scsi: mpi3mr: Support Prepare for Reset event scsi: mpi3mr: Add Event acknowledgment logic scsi: mpi3mr: Gracefully handle online FW update operation scsi: mpi3mr: Detect async reset that occurred in firmware scsi: mpi3mr: Add IOC reinit function scsi: mpi3mr: Handle offline FW activation in graceful manner scsi: mpi3mr: Code refactor of IOC init - part2 scsi: mpi3mr: Code refactor of IOC init - part1 scsi: mpi3mr: Fault IOC when internal command gets timeout scsi: mpi3mr: Display IOC firmware package version scsi: mpi3mr: Handle unaligned PLL in unmap cmnds scsi: mpi3mr: Increase internal cmnds timeout to 60s scsi: mpi3mr: Do access status validation before adding devices scsi: mpi3mr: Add support for PCIe Managed Switch SES device ...
2022-01-14ata: libata: Rename link flag ATA_LFLAG_NO_DB_DELAYPaul Menzel1-1/+1
Rename the link flag ATA_LFLAG_NO_DB_DELAY to ATA_LFLAG_NO_DEBOUNCE_DELAY. The new name is longer, but clearer. Signed-off-by: Paul Menzel <[email protected]> Signed-off-by: Damien Le Moal <[email protected]>
2022-01-14ata: fix read_id() ata port operation interfaceDamien Le Moal1-2/+3
Drivers that need to tweak a device IDENTIFY data implement the read_id() port operation. The IDENTIFY data buffer is passed as an argument to the read_id() operation for drivers to use. However, when this operation is called, the IDENTIFY data is not yet converted to CPU endian and contains le16 words. Change the interface of the read_id operation to pass a __le16 * pointer to the IDENTIFY data buffer to clarify the buffer endianness. Fix the pata_netcell, pata_it821x, ahci_xgene, ahci_ceva and ahci_brcm drivers implementation of this operation and modify the code to corretly deal with identify data words manipulation to avoid sparse warnings such as: drivers/ata/ahci_xgene.c:262:33: warning: invalid assignment: &= drivers/ata/ahci_xgene.c:262:33: left side has type unsigned short drivers/ata/ahci_xgene.c:262:33: right side has type restricted __le16 Signed-off-by: Damien Le Moal <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]>
2022-01-14ata: sata_fsl: fix scsi host initializationDamien Le Moal1-0/+11
When compiling with W=1, the sata_fsl driver compilation throws the warning: drivers/ata/sata_fsl.c:1385:22: error: initialized field overwritten [-Werror=override-init] 1385 | .can_queue = SATA_FSL_QUEUE_DEPTH, This is due to the driver scsi host template initialization overwriting the can_queue field that is already set using the ATA_NCQ_SHT() initializer macro, resulting in the same field being initialized twice in the host template declaration. To remove this warning, introduce the ATA_SUBBASE_SHT_QD() and ATA_NCQ_SHT_QD() initialization macros to allow specifying a queue depth different from the default ATA_DEF_QUEUE using an additional argument to the macro. Signed-off-by: Damien Le Moal <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]>
2022-01-13tracing: Account bottom half disabled sections.Sebastian Andrzej Siewior1-0/+1
Disabling only bottom halves via local_bh_disable() disables also preemption but this remains invisible to tracing. On a CONFIG_PREEMPT kernel one might wonder why there is no scheduling happening despite the N flag in the trace. The reason might be the a rcu_read_lock_bh() section. Add a 'b' to the tracing output if in task context with disabled bottom halves. Link: https://lkml.kernel.org/r/YbcbtdtC/[email protected] Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2022-01-13Merge tag 'mailbox-v5.17' of ↵Linus Torvalds1-0/+2
git://git.linaro.org/landing-teams/working/fujitsu/integration Pull mailbox updates from Jassi Brar: - qcom: misc updates to qcom-ipcc driver - mpfs: change compatible string - pcc: - fix handling of subtypes - avoid uninitialized variable - mtk: - add missing of_node_put - enable control_by_sw - silent probe-defer prints - fix gce_num for mt8192 - zynq: add missing of_node_put - imx: check for NULL instead of IS_ERR - appple: switch to generic compatibles - hi3660: convert comments to kernel-doc notation * tag 'mailbox-v5.17' of git://git.linaro.org/landing-teams/working/fujitsu/integration: dt-bindings: mailbox: Add more protocol and client ID mailbox: qcom-ipcc: Support interrupt wake up from suspend mailbox: qcom-ipcc: Support more IPCC instance mailbox: qcom-ipcc: Dynamic alloc for channel arrangement mailbox: change mailbox-mpfs compatible string mailbox: pcc: Handle all PCC subtypes correctly in pcc_mbox_irq mailbox: pcc: Avoid using the uninitialized variable 'dev' mailbox: mtk: add missing of_node_put before return mailbox: zynq: add missing of_node_put before return mailbox: imx: Fix an IS_ERR() vs NULL bug mailbox: hi3660: convert struct comments to kernel-doc notation mailbox: add control_by_sw for mt8195 mailbox: mtk-cmdq: Silent EPROBE_DEFER errors for clks mailbox: fix gce_num of mt8192 driver data mailbox: apple: Bind to generic compatibles dt-bindings: mailbox: apple,mailbox: Add generic and t6000 compatibles
2022-01-13Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds4-1/+31
Pull rdma updates from Jason Gunthorpe: "Another small cycle. Mostly cleanups and bug fixes, quite a bit assisted from bots. There are a few new syzkaller splats that haven't been solved yet but they should get into the rcs in a few weeks, I think. Summary: - Update drivers to use common helpers for GUIDs, pkeys, bitmaps, memset_startat, and others - General code cleanups from bots - Simplify some of the rxe pool code in preparation for a larger rework - Clean out old stuff from hns, including all support for hip06 devices - Fix a bug where GID table entries could be missed if the table had holes in it - Rename paths and sessions in rtrs for better understandability - Consolidate the roce source port selection code - NDR speed support in mlx5" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (83 commits) RDMA/irdma: Remove the redundant return RDMA/rxe: Use the standard method to produce udp source port RDMA/irdma: Make the source udp port vary RDMA/hns: Replace get_udp_sport with rdma_get_udp_sport RDMA/core: Calculate UDP source port based on flow label or lqpn/rqpn IB/qib: Fix typos RDMA/rtrs-clt: Rename rtrs_clt to rtrs_clt_sess RDMA/rtrs-srv: Rename rtrs_srv to rtrs_srv_sess RDMA/rtrs-clt: Rename rtrs_clt_sess to rtrs_clt_path RDMA/rtrs-srv: Rename rtrs_srv_sess to rtrs_srv_path RDMA/rtrs: Rename rtrs_sess to rtrs_path RDMA/hns: Modify the hop num of HIP09 EQ to 1 IB/iser: Align coding style across driver IB/iser: Remove un-needed casting to/from void pointer IB/iser: Don't suppress send completions IB/iser: Rename ib_ret local variable IB/iser: Fix RNR errors IB/iser: Remove deprecated pi_guard module param IB/mlx5: Expose NDR speed through MAD RDMA/cxgb4: Set queue pair state when being queried ...
2022-01-13Merge tag 'v5.16' into rdma.git for-nextJason Gunthorpe25-35/+108
To resolve minor conflict in: drivers/infiniband/hw/mlx5/mlx5_ib.h By merging both hunks. Signed-off-by: Jason Gunthorpe <[email protected]>
2022-01-13Merge tag 'irq-msi-2022-01-13' of ↵Linus Torvalds5-153/+179
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull MSI irq updates from Thomas Gleixner: "Rework of the MSI interrupt infrastructure. This is a treewide cleanup and consolidation of MSI interrupt handling in preparation for further changes in this area which are necessary to: - address existing shortcomings in the VFIO area - support the upcoming Interrupt Message Store functionality which decouples the message store from the PCI config/MMIO space" * tag 'irq-msi-2022-01-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (94 commits) genirq/msi: Populate sysfs entry only once PCI/MSI: Unbreak pci_irq_get_affinity() genirq/msi: Convert storage to xarray genirq/msi: Simplify sysfs handling genirq/msi: Add abuse prevention comment to msi header genirq/msi: Mop up old interfaces genirq/msi: Convert to new functions genirq/msi: Make interrupt allocation less convoluted platform-msi: Simplify platform device MSI code platform-msi: Let core code handle MSI descriptors bus: fsl-mc-msi: Simplify MSI descriptor handling soc: ti: ti_sci_inta_msi: Remove ti_sci_inta_msi_domain_free_irqs() soc: ti: ti_sci_inta_msi: Rework MSI descriptor allocation NTB/msi: Convert to msi_on_each_desc() PCI: hv: Rework MSI handling powerpc/mpic_u3msi: Use msi_for_each-desc() powerpc/fsl_msi: Use msi_for_each_desc() powerpc/pasemi/msi: Convert to msi_on_each_dec() powerpc/cell/axon_msi: Convert to msi_on_each_desc() powerpc/4xx/hsta: Rework MSI handling ...
2022-01-13Merge tag 'timers-core-2022-01-13' of ↵Linus Torvalds1-0/+20
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer updates from Thomas Gleixner: "Updates for the time(r) subsystem: Core: - Make the clocksource watchdog more robust by better validation checks of the measurement. Drivers: - New drivers for MStar and SSD20xd SOCs - The usual cleanups and improvements all over the place" * tag 'timers-core-2022-01-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: dt-bindings: timer: Add Mstar MSC313e timer devicetree bindings documentation clocksource/drivers/msc313e: Add support for ssd20xd-based platforms clocksource/drivers: Add MStar MSC313e timer support clocksource/drivers/pistachio: Fix -Wunused-but-set-variable warning clocksource/drivers/timer-imx-sysctr: Set cpumask to cpu_possible_mask clocksource/drivers/imx-sysctr: Mark two variable with __ro_after_init clocksource/drivers/renesas,ostm: Make RENESAS_OSTM symbol visible clocksource/drivers/renesas-ostm: Add RZ/G2L OSTM support dt-bindings: timer: renesas: ostm: Document Renesas RZ/G2L OSTM clocksource/drivers/exynos_mct: Fix silly typo resulting in checkpatch warning clocksource: Reduce the default clocksource_watchdog() retries to 2 clocksource: Avoid accidental unstable marking of clocksources dt-bindings: timer: tpm-timer: Add imx8ulp compatible string reset: Add of_reset_control_get_optional_exclusive() clocksource/drivers/exynos_mct: Refactor resources allocation dt-bindings: timer: remove rockchip,rk3066-timer compatible string from rockchip,rk-timer.yaml dt-bindings: timer: cadence_ttc: Add power-domains
2022-01-13Merge tag 'irq-core-2022-01-13' of ↵Linus Torvalds3-3/+56
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq updates from Thomas Gleixner: "Updates for the interrupt subsystem: Core: - Provide a new interface for affinity hints to provide a separation between hint and actual affinity change which has become a hidden property of the current interface - Fix up the in tree usage of the affinity hint interfaces Drivers: - No new irqchip drivers! - Fix GICv3 redistributor table reservation with RT across kexec - Fix GICv4.1 redistributor view of the VPE table across kexec - Add support for extra interrupts on spear-shirq - Make obtaining some interrupts optional for the Renesas drivers - Various cleanups and bug fixes" * tag 'irq-core-2022-01-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits) irqchip/renesas-intc-irqpin: Use platform_get_irq_optional() to get the interrupt irqchip/renesas-irqc: Use platform_get_irq_optional() to get the interrupt irqchip/gic-v4: Disable redistributors' view of the VPE table at boot time irqchip/ingenic-tcu: Use correctly sized arguments for bit field irqchip/gic-v2m: Add const to of_device_id irqchip/imx-gpcv2: Mark imx_gpcv2_instance with __ro_after_init irqchip/spear-shirq: Add support for IRQ 0..6 irqchip/gic-v3-its: Limit memreserve cpuhp state lifetime irqchip/gic-v3-its: Postpone LPI pending table freeing and memreserve irqchip/gic-v3-its: Give the percpu rdist struct its own flags field net/mlx4: Use irq_update_affinity_hint() net/mlx5: Use irq_set_affinity_and_hint() hinic: Use irq_set_affinity_and_hint() scsi: lpfc: Use irq_set_affinity() mailbox: Use irq_update_affinity_hint() ixgbe: Use irq_update_affinity_hint() be2net: Use irq_update_affinity_hint() enic: Use irq_update_affinity_hint() RDMA/irdma: Use irq_update_affinity_hint() scsi: mpt3sas: Use irq_set_affinity_and_hint() ...
2022-01-13Merge branch 'pci/errors'Bjorn Helgaas1-0/+9
- Add PCI_ERROR_RESPONSE and related definitions for signaling and checking for transaction errors on PCI (Naveen Naidu) - Fabricate PCI_ERROR_RESPONSE data (~0) in config read wrappers, instead of in host controller drivers, when transactions fail on PCI (Naveen Naidu) - Use PCI_POSSIBLE_ERROR() to check for possible failure of config reads (Naveen Naidu) * pci/errors: PCI: xgene: Use PCI_ERROR_RESPONSE to identify config read errors PCI: hv: Use PCI_ERROR_RESPONSE to identify config read errors PCI: keystone: Use PCI_ERROR_RESPONSE to identify config read errors PCI: Use PCI_ERROR_RESPONSE to identify config read errors PCI: cpqphp: Use PCI_POSSIBLE_ERROR() to check config reads PCI/PME: Use PCI_POSSIBLE_ERROR() to check config reads PCI/DPC: Use PCI_POSSIBLE_ERROR() to check config reads PCI: pciehp: Use PCI_POSSIBLE_ERROR() to check config reads PCI: vmd: Use PCI_POSSIBLE_ERROR() to check config reads PCI/ERR: Use PCI_POSSIBLE_ERROR() to check config reads PCI: rockchip-host: Drop error data fabrication when config read fails PCI: rcar-host: Drop error data fabrication when config read fails PCI: altera: Drop error data fabrication when config read fails PCI: mvebu: Drop error data fabrication when config read fails PCI: aardvark: Drop error data fabrication when config read fails PCI: kirin: Drop error data fabrication when config read fails PCI: histb: Drop error data fabrication when config read fails PCI: exynos: Drop error data fabrication when config read fails PCI: mediatek: Drop error data fabrication when config read fails PCI: iproc: Drop error data fabrication when config read fails PCI: thunder: Drop error data fabrication when config read fails PCI: Drop error data fabrication when config read fails PCI: Use PCI_SET_ERROR_RESPONSE() for disconnected devices PCI: Set error response data when config read fails PCI: Add PCI_ERROR_RESPONSE and related definitions
2022-01-13Merge branch 'pci/misc'Bjorn Helgaas2-94/+94
- Sort Intel Device IDs by value (Andy Shevchenko) - Change Capability offsets to hex to match spec (Baruch Siach) - Correct misspellings (Krzysztof Wilczyński) - Terminate statement with semicolon in pci_endpoint_test.c (Ming Wang) * pci/misc: misc: pci_endpoint_test: Terminate statement with semicolon PCI: Correct misspelled words PCI: Change capability register offsets to hex PCI: Sort Intel Device IDs by value
2022-01-13Merge branch 'pci/host/hv'Bjorn Helgaas1-33/+0
- Add hv-internal interfaces to encapsulate arch IRQ dependencies (Sunil Muthuswamy) - Add arm64 Hyper-V vPCI support (Sunil Muthuswamy) * pci/host/hv: PCI: hv: Add arm64 Hyper-V vPCI support PCI: hv: Make the code arch neutral by adding arch specific interfaces
2022-01-13Merge branch 'pci/resource'Bjorn Helgaas1-0/+1
- Always write Intel I210 ROM BAR on update to work around device defect (Bjorn Helgaas) * pci/resource: PCI: Work around Intel I210 ROM BAR overlap defect