aboutsummaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)AuthorFilesLines
2022-07-17hugetlb: do not update address in huge_pmd_unshareMike Kravetz1-2/+2
As an optimization for loops sequentially processing hugetlb address ranges, huge_pmd_unshare would update a passed address if it unshared a pmd. Updating a loop control variable outside the loop like this is generally a bad idea. These loops are now using hugetlb_mask_last_page to optimize scanning when non-present ptes are discovered. The same can be done when huge_pmd_unshare returns 1 indicating a pmd was unshared. Remove address update from huge_pmd_unshare. Change the passed argument type and update all callers. In loops sequentially processing addresses use hugetlb_mask_last_page to update address if pmd is unshared. [[email protected]: fix an unused variable warning/error] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Kravetz <[email protected]> Signed-off-by: Stephen Rothwell <[email protected]> Acked-by: Muchun Song <[email protected]> Reviewed-by: Baolin Wang <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: James Houghton <[email protected]> Cc: kernel test robot <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Peter Xu <[email protected]> Cc: Rolf Eike Beer <[email protected]> Cc: Will Deacon <[email protected]> Cc: Stephen Rothwell <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17hugetlb: skip to end of PT page mapping when pte not presentMike Kravetz1-0/+1
Patch series "hugetlb: speed up linear address scanning", v2. At unmap, fork and remap time hugetlb address ranges are linearly scanned. We can optimize these scans if the ranges are sparsely populated. Also, enable page table "Lazy copy" for hugetlb at fork. NOTE: Architectures not defining CONFIG_ARCH_WANT_GENERAL_HUGETLB need to add an arch specific version hugetlb_mask_last_page() to take advantage of sparse address scanning improvements. Baolin Wang added the routine for arm64. Other architectures which could be optimized are: ia64, mips, parisc, powerpc, s390, sh and sparc. This patch (of 4): HugeTLB address ranges are linearly scanned during fork, unmap and remap operations. If a non-present entry is encountered, the code currently continues to the next huge page aligned address. However, a non-present entry implies that the page table page for that entry is not present. Therefore, the linear scan can skip to the end of range mapped by the page table page. This can speed operations on large sparsely populated hugetlb mappings. Create a new routine hugetlb_mask_last_page() that will return an address mask. When the mask is ORed with an address, the result will be the address of the last huge page mapped by the associated page table page. Use this mask to update addresses in routines which linearly scan hugetlb address ranges when a non-present pte is encountered. hugetlb_mask_last_page is related to the implementation of huge_pte_offset as hugetlb_mask_last_page is called when huge_pte_offset returns NULL. This patch only provides a complete hugetlb_mask_last_page implementation when CONFIG_ARCH_WANT_GENERAL_HUGETLB is defined. Architectures which provide their own versions of huge_pte_offset can also provide their own version of hugetlb_mask_last_page. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Kravetz <[email protected]> Tested-by: Baolin Wang <[email protected]> Reviewed-by: Baolin Wang <[email protected]> Acked-by: Muchun Song <[email protected]> Reported-by: kernel test robot <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Peter Xu <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: James Houghton <[email protected]> Cc: Mina Almasry <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Will Deacon <[email protected]> Cc: Rolf Eike Beer <[email protected]> Cc: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17mm: khugepaged: reorg some khugepaged helpersYang Shi2-14/+8
The khugepaged_{enabled|always|req_madv} are not khugepaged only anymore, move them to huge_mm.h and rename to hugepage_flags_xxx, and remove khugepaged_req_madv due to no users. Also move khugepaged_defrag to khugepaged.c since its only caller is in that file, it doesn't have to be in a header file. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Shi <[email protected]> Reviewed-by: Zach O'Keefe <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17mm: thp: kill __transhuge_page_enabled()Yang Shi1-55/+2
The page fault path checks THP eligibility with __transhuge_page_enabled() which does the similar thing as hugepage_vma_check(), so use hugepage_vma_check() instead. However page fault allows DAX and !anon_vma cases, so added a new flag, in_pf, to hugepage_vma_check() to make page fault work correctly. The in_pf flag is also used to skip shmem and file THP for page fault since shmem handles THP in its own shmem_fault() and file THP allocation on fault is not supported yet. Also remove hugepage_vma_enabled() since hugepage_vma_check() is the only caller now, it is not necessary to have a helper function. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Shi <[email protected]> Reviewed-by: Zach O'Keefe <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17mm: thp: kill transparent_hugepage_active()Yang Shi2-8/+10
The transparent_hugepage_active() was introduced to show THP eligibility bit in smaps in proc, smaps is the only user. But it actually does the similar check as hugepage_vma_check() which is used by khugepaged. We definitely don't have to maintain two similar checks, so kill transparent_hugepage_active(). This patch also fixed the wrong behavior for VM_NO_KHUGEPAGED vmas. Also move hugepage_vma_check() to huge_memory.c and huge_mm.h since it is not only for khugepaged anymore. [[email protected]: check vma->vm_mm, per Zach] [[email protected]: add comment to vdso check] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Shi <[email protected]> Reviewed-by: Zach O'Keefe <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17mm: thp: consolidate vma size check to transhuge_vma_suitableYang Shi2-14/+11
There are couple of places that check whether the vma size is ok for THP or whether address fits, they are open coded and duplicate, use transhuge_vma_suitable() to do the job by passing in (vma->end - HPAGE_PMD_SIZE). Move vma size check into hugepage_vma_check(). This will make khugepaged_enter() is as same as khugepaged_enter_vma(). There is just one caller for khugepaged_enter(), replace it to khugepaged_enter_vma() and remove khugepaged_enter(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Shi <[email protected]> Reviewed-by: Zach O'Keefe <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17fsdax: dedup file range to use a compare functionShiyang Ruan2-4/+16
With dax we cannot deal with readpage() etc. So, we create a dax comparison function which is similar with vfs_dedupe_file_range_compare(). And introduce dax_remap_file_range_prep() for filesystem use. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Goldwyn Rodrigues <[email protected]> Signed-off-by: Shiyang Ruan <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Cc: Al Viro <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Goldwyn Rodrigues <[email protected]> Cc: Jane Chu <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Ritesh Harjani <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17fsdax: set a CoW flag when associate reflink mappingsShiyang Ruan1-0/+6
Introduce a PAGE_MAPPING_DAX_COW flag to support association with CoW file mappings. In this case, since the dax-rmap has already took the responsibility to look up for shared files by given dax page, the page->mapping is no longer to used for rmap but for marking that this dax page is shared. And to make sure disassociation works fine, we use page->index as refcount, and clear page->mapping to the initial state when page->index is decreased to 0. With the help of this new flag, it is able to distinguish normal case and CoW case, and keep the warning in normal case. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Shiyang Ruan <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Cc: Al Viro <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Goldwyn Rodrigues <[email protected]> Cc: Goldwyn Rodrigues <[email protected]> Cc: Jane Chu <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Ritesh Harjani <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17mm: introduce mf_dax_kill_procs() for fsdax caseShiyang Ruan1-0/+2
This new function is a variant of mf_generic_kill_procs that accepts a file, offset pair instead of a struct to support multiple files sharing a DAX mapping. It is intended to be called by the file systems as part of the memory_failure handler after the file system performed a reverse mapping from the storage address to the file and file offset. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Shiyang Ruan <[email protected]> Reviewed-by: Dan Williams <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Reviewed-by: Miaohe Lin <[email protected]> Cc: Al Viro <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Goldwyn Rodrigues <[email protected]> Cc: Goldwyn Rodrigues <[email protected]> Cc: Jane Chu <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Ritesh Harjani <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17fsdax: introduce dax_lock_mapping_entry()Shiyang Ruan1-0/+15
The current dax_lock_page() locks dax entry by obtaining mapping and index in page. To support 1-to-N RMAP in NVDIMM, we need a new function to lock a specific dax entry corresponding to this file's mapping,index. And output the page corresponding to the specific dax entry for caller use. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Shiyang Ruan <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Cc: Al Viro <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Goldwyn Rodrigues <[email protected]> Cc: Goldwyn Rodrigues <[email protected]> Cc: Jane Chu <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Ritesh Harjani <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17pagemap,pmem: introduce ->memory_failure()Shiyang Ruan1-0/+12
When memory-failure occurs, we call this function which is implemented by each kind of devices. For the fsdax case, pmem device driver implements it. Pmem device driver will find out the filesystem in which the corrupted page located in. With dax_holder notify support, we are able to notify the memory failure from pmem driver to upper layers. If there is something not support in the notify routine, memory_failure will fall back to the generic hanlder. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Shiyang Ruan <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Dan Williams <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Cc: Al Viro <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Goldwyn Rodrigues <[email protected]> Cc: Goldwyn Rodrigues <[email protected]> Cc: Jane Chu <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Ritesh Harjani <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17dax: introduce holder for dax_deviceShiyang Ruan1-8/+25
Patch series "v14 fsdax-rmap + v11 fsdax-reflink", v2. The patchset fsdax-rmap is aimed to support shared pages tracking for fsdax. It moves owner tracking from dax_assocaite_entry() to pmem device driver, by introducing an interface ->memory_failure() for struct pagemap. This interface is called by memory_failure() in mm, and implemented by pmem device. Then call holder operations to find the filesystem which the corrupted data located in, and call filesystem handler to track files or metadata associated with this page. Finally we are able to try to fix the corrupted data in filesystem and do other necessary processing, such as killing processes who are using the files affected. The call trace is like this: memory_failure() |* fsdax case |------------ |pgmap->ops->memory_failure() => pmem_pgmap_memory_failure() | dax_holder_notify_failure() => | dax_device->holder_ops->notify_failure() => | - xfs_dax_notify_failure() | |* xfs_dax_notify_failure() | |-------------------------- | | xfs_rmap_query_range() | | xfs_dax_failure_fn() | | * corrupted on metadata | | try to recover data, call xfs_force_shutdown() | | * corrupted on file data | | try to recover data, call mf_dax_kill_procs() |* normal case |------------- |mf_generic_kill_procs() The patchset fsdax-reflink attempts to add CoW support for fsdax, and takes XFS, which has both reflink and fsdax features, as an example. One of the key mechanisms needed to be implemented in fsdax is CoW. Copy the data from srcmap before we actually write data to the destination iomap. And we just copy range in which data won't be changed. Another mechanism is range comparison. In page cache case, readpage() is used to load data on disk to page cache in order to be able to compare data. In fsdax case, readpage() does not work. So, we need another compare data with direct access support. With the two mechanisms implemented in fsdax, we are able to make reflink and fsdax work together in XFS. This patch (of 14): To easily track filesystem from a pmem device, we introduce a holder for dax_device structure, and also its operation. This holder is used to remember who is using this dax_device: - When it is the backend of a filesystem, the holder will be the instance of this filesystem. - When this pmem device is one of the targets in a mapped device, the holder will be this mapped device. In this case, the mapped device has its own dax_device and it will follow the first rule. So that we can finally track to the filesystem we needed. The holder and holder_ops will be set when filesystem is being mounted, or an target device is being activated. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Shiyang Ruan <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Dan Williams <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Jane Chu <[email protected]> Cc: Goldwyn Rodrigues <[email protected]> Cc: Al Viro <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Dan Williams <[email protected]> Cc: Goldwyn Rodrigues <[email protected]> Cc: Ritesh Harjani <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17mm: add device coherent vma selection for memory migrationAlex Sierra1-0/+1
This case is used to migrate pages from device memory, back to system memory. Device coherent type memory is cache coherent from device and CPU point of view. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Alex Sierra <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Acked-by: Felix Kuehling <[email protected]> Reviewed-by: Alistair Poppple <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Ralph Campbell <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17mm: add zone device coherent type memory supportAlex Sierra2-1/+23
Device memory that is cache coherent from device and CPU point of view. This is used on platforms that have an advanced system bus (like CAPI or CXL). Any page of a process can be migrated to such memory. However, no one should be allowed to pin such memory so that it can always be evicted. [[email protected]: rebased ontop of the refcount changes, remove is_dev_private_or_coherent_page] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Alex Sierra <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Acked-by: Felix Kuehling <[email protected]> Reviewed-by: Alistair Popple <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Ralph Campbell <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17mm: move page zone helpers from mm.h to mmzone.hAlex Sierra3-79/+81
It makes more sense to have these helpers in zone specific header file, rather than the generic mm.h Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Alex Sierra <[email protected]> Cc: Alistair Popple <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Felix Kuehling <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Ralph Campbell <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17mm: rename is_pinnable_page() to is_longterm_pinnable_page()Alex Sierra1-4/+4
Patch series "Add MEMORY_DEVICE_COHERENT for coherent device memory mapping", v9. This patch series introduces MEMORY_DEVICE_COHERENT, a type of memory owned by a device that can be mapped into CPU page tables like MEMORY_DEVICE_GENERIC and can also be migrated like MEMORY_DEVICE_PRIVATE. This patch series is mostly self-contained except for a few places where it needs to update other subsystems to handle the new memory type. System stability and performance are not affected according to our ongoing testing, including xfstests. How it works: The system BIOS advertises the GPU device memory (aka VRAM) as SPM (special purpose memory) in the UEFI system address map. The amdgpu driver registers the memory with devmap as MEMORY_DEVICE_COHERENT using devm_memremap_pages. The initial user for this hardware page migration capability is the Frontier supercomputer project. This functionality is not AMD-specific. We expect other GPU vendors to find this functionality useful, and possibly other hardware types in the future. Our test nodes in the lab are similar to the Frontier configuration, with .5 TB of system memory plus 256 GB of device memory split across 4 GPUs, all in a single coherent address space. Page migration is expected to improve application efficiency significantly. We will report empirical results as they become available. Coherent device type pages at gup are now migrated back to system memory if they are being pinned long-term (FOLL_LONGTERM). The reason is, that long-term pinning would interfere with the device memory manager owning the device-coherent pages (e.g. evictions in TTM). These series incorporate Alistair Popple patches to do this migration from pin_user_pages() calls. hmm_gup_test has been added to hmm-test to test different get user pages calls. This series includes handling of device-managed anonymous pages returned by vm_normal_pages. Although they behave like normal pages for purposes of mapping in CPU page tables and for COW, they do not support LRU lists, NUMA migration or THP. We also introduced a FOLL_LRU flag that adds the same behaviour to follow_page and related APIs, to allow callers to specify that they expect to put pages on an LRU list. This patch (of 14): is_pinnable_page() and folio_is_pinnable() are renamed to is_longterm_pinnable_page() and folio_is_longterm_pinnable() respectively. These functions are used in the FOLL_LONGTERM flag context. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Alex Sierra <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Felix Kuehling <[email protected]> Cc: Ralph Campbell <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Alistair Popple <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-07-17mm: Add PAGE_ALIGN_DOWN macroDavid Gow1-0/+3
This is just the same as PAGE_ALIGN(), but rounds the address down, not up. Suggested-by: Dmitry Vyukov <[email protected]> Signed-off-by: David Gow <[email protected]> Acked-by: Andrew Morton <[email protected]> Signed-off-by: Richard Weinberger <[email protected]>
2022-07-17net/mlx5: fs, allow flow table creation with a UIDMark Bloch1-0/+1
Add UID field to flow table attributes to allow creating flow tables with a non default (zero) uid. Signed-off-by: Mark Bloch <[email protected]> Reviewed-by: Alex Vesker <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2022-07-17net/mlx5: fs, expose flow table ID to usersMark Bloch1-0/+1
Expose the flow table ID to users. This will be used by downstream patches to allow creating steering rules that point to a flow table ID. Signed-off-by: Mark Bloch <[email protected]> Reviewed-by: Maor Gottlieb <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2022-07-17net/mlx5: Expose the ability to point to any UID from shared UIDMark Bloch1-2/+4
Expose shared_object_to_user_object_allowed, this capability means an object created with shared UID can point to any UID. Signed-off-by: Mark Bloch <[email protected]> Reviewed-by: Maor Gottlieb <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2022-07-17Merge tag 'input-for-v5.19-rc6' of ↵Linus Torvalds1-4/+7
git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input Pull input fixes from Dmitry Torokhov: - fix Goodix driver to properly behave on the Aya Neo Next - some more sanity checks in usbtouchscreen driver - a tweak in wm97xx driver in preparation for remove() to return void - a clarification in input core regarding units of measurement for resolution on touch events. * tag 'input-for-v5.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: Input: document the units for resolution of size axes Input: goodix - call acpi_device_fix_up_power() in some cases Input: wm97xx - make .remove() obviously always return 0 Input: usbtouchscreen - add driver_info sanity check
2022-07-17KVM: arm64: vgic: Consolidate userspace access for base address settingMarc Zyngier1-1/+0
Align kvm_vgic_addr() with the rest of the code by moving the userspace accesses into it. kvm_vgic_addr() is also made static. Signed-off-by: Marc Zyngier <[email protected]>
2022-07-17KVM: arm64: vgic-v2: Add helper for legacy dist/cpuif base address settingMarc Zyngier1-0/+1
We carry a legacy interface to set the base addresses for GICv2. As this is currently plumbed into the same handling code as the modern interface, it limits the evolution we can make there. Add a helper dedicated to this handling, with a view of maybe removing this in the future. Signed-off-by: Marc Zyngier <[email protected]>
2022-07-17media: mc-entity: Add a new helper function to get a remote pad for a padLaurent Pinchart1-0/+18
The newly added media_entity_remote_source_pad_unique() helper function handles use cases where the entity has a link enabled uniqueness constraint covering all pads. There are use cases where the constraint covers a specific pad only. Add a new media_pad_remote_pad_unique() function to handle this. It operates as media_entity_remote_source_pad_unique(), but on a given pad instead of on the entity. Signed-off-by: Laurent Pinchart <[email protected]> Acked-by: Hans Verkuil <[email protected]> Acked-by: Sakari Ailus <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>
2022-07-17media: mc-entity: Add a new helper function to get a remote padLaurent Pinchart1-0/+46
The media_entity_remote_pad_first() helper function returns the first remote pad it finds connected to a given pad. Beside being possibly non-deterministic (as it stops at the first enabled link), the fact that it returns the first match makes it unsuitable for drivers that need to guarantee that a single link is enabled, for instance when an entity can process data from one of multiple sources at a time. For those use cases, add a new helper function, media_entity_remote_pad_unique(), that operates on an entity and returns a remote pad, with a guarantee that only one link is enabled. To ease its use in drivers, also add an inline wrapper that locates source pads specifically. A wrapper that locates sink pads can easily be added when needed. Signed-off-by: Laurent Pinchart <[email protected]> Acked-by: Hans Verkuil <[email protected]> Acked-by: Sakari Ailus <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>
2022-07-17media: mc-entity: Rename media_entity_remote_pad() to ↵Laurent Pinchart1-2/+2
media_pad_remote_pad_first() The media_entity_remote_pad() is misnamed, as it operates on a pad and not an entity. Rename it to media_pad_remote_pad_first() to clarify its behaviour. Signed-off-by: Laurent Pinchart <[email protected]> Reviewed-by: Hans Verkuil <[email protected]> Acked-by: Sakari Ailus <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>
2022-07-17media: v4l2-async: Add notifier operation to destroy asd instancesLaurent Pinchart1-0/+2
Drivers typically extend the v4l2_async_subdev structure by embedding it in a driver-specific structure, to store per-subdev custom data. The v4l2_async_subdev instances are freed by the v4l2-async framework, which makes this mechanism cumbersome to use safely when custom data needs special treatment to be destroyed (such as freeing additional memory, or releasing references to kernel objects). To ease this, add a .destroy() operation to the v4l2_async_notifier_operations structure. The operation is called right before the v4l2_async_subdev is freed, giving drivers a chance to destroy data if needed. Signed-off-by: Laurent Pinchart <[email protected]> Reviewed-by: Hans Verkuil <[email protected]> Acked-by: Sakari Ailus <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>
2022-07-17media: videobuf2: Introduce vb2_find_buffer()Ezequiel Garcia1-0/+10
All users of vb2_find_timestamp() combine it with vb2_get_buffer() to retrieve a videobuf2 buffer, given a u64 timestamp. Introduce an API for this use-case. Users will be converted to the new API as follow-up commits. Signed-off-by: Ezequiel Garcia <[email protected]> Acked-by: Tomasz Figa <[email protected]> Reviewed-by: Philipp Zabel <[email protected]> Signed-off-by: Hans Verkuil <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>
2022-07-17media: Add P010 tiled formatEzequiel Garcia1-0/+1
Add P010 tiled format [rebased, updated pixel format name and added description] Tested-by: Benjamin Gaignard <[email protected]> Signed-off-by: Ezequiel Garcia <[email protected]> Signed-off-by: Jernej Skrabec <[email protected]> Signed-off-by: Hans Verkuil <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>
2022-07-16Merge tag 'tty-5.19-rc7' of ↵Linus Torvalds2-1/+7
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty Pull tty and serial driver fixes from Greg KH: "Here are some TTY and Serial driver fixes for 5.19-rc7. They resolve a number of reported problems including: - longtime bug in pty_write() that has been reported in the past. - 8250 driver fixes - new serial device ids - vt overlapping data copy bugfix - other tiny serial driver bugfixes All of these have been in linux-next for a while with no reported problems" * tag 'tty-5.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: tty: use new tty_insert_flip_string_and_push_buffer() in pty_write() tty: extract tty_flip_buffer_commit() from tty_flip_buffer_push() serial: 8250: dw: Fix the macro RZN1_UART_xDMACR_8_WORD_BURST vt: fix memory overlapping when deleting chars in the buffer serial: mvebu-uart: correctly report configured baudrate value serial: 8250: Fix PM usage_count for console handover serial: 8250: fix return error code in serial8250_request_std_resource() serial: stm32: Clear prev values before setting RTS delays tty: Add N_CAN327 line discipline ID for ELM327 based CAN driver serial: 8250: Fix __stop_tx() & DMA Tx restart races serial: pl011: UPSTAT_AUTORTS requires .throttle/unthrottle tty: serial: samsung_tty: set dma burst_size to 1 serial: 8250: dw: enable using pdata with ACPI
2022-07-16iio: trigger: move trig->owner init to trigger allocate() stageDmitry Rokosov2-14/+16
To provide a new IIO trigger to the IIO core, usually driver executes the following pipeline: allocate()/register()/get(). Before, IIO core assigned trig->owner as a pointer to the module which registered this trigger at the register() stage. But actually the trigger object is owned by the module earlier, on the allocate() stage, when trigger object is successfully allocated for the driver. This patch moves trig->owner initialization from register() stage of trigger initialization pipeline to allocate() stage to eliminate all misunderstandings and time gaps between trigger object creation and owner acquiring. Signed-off-by: Dmitry Rokosov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jonathan Cameron <[email protected]>
2022-07-16fs: remove no_llseekJason A. Donenfeld1-1/+1
Now that all callers of ->llseek are going through vfs_llseek(), we don't gain anything by keeping no_llseek around. Nothing actually calls it and setting ->llseek to no_lseek is completely equivalent to leaving it NULL. Longer term (== by the end of merge window) we want to remove all such intializations. To simplify the merge window this commit does *not* touch initializers - it only defines no_llseek as NULL (and simplifies the tests on file opening). At -rc1 we'll need do a mechanical removal of no_llseek - git grep -l -w no_llseek | grep -v porting.rst | while read i; do sed -i '/\<no_llseek\>/d' $i done would do it. Signed-off-by: Jason A. Donenfeld <[email protected]> Signed-off-by: Al Viro <[email protected]>
2022-07-16serial: remove VR41XX serial driverThomas Bogendoerfer1-4/+0
Commit d3164e2f3b0a ("MIPS: Remove VR41xx support") removed support for MIPS VR41xx platform, so remove exclusive drivers for this platform, too. Signed-off-by: Thomas Bogendoerfer <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2022-07-16Merge tag 'extcon-next-for-5.20' of ↵Greg Kroah-Hartman1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/chanwoo/extcon into char-misc-next Chanwoo writes: Update extcon next for v5.20 Detailed description for this pull request: 1. Add new connector type of both EXTCON_DISP_CVBS and EXTCON_DISP_EDP - Add both EXTCON_DISP_CVBS for Composite Video Broadcast Signal[1] and EXTCON_DISP_EDP for Embedded Display Port[2]. [1] https://en.wikipedia.org/wiki/Composite_video [2] https://en.wikipedia.org/wiki/DisplayPort#eDP 2. Fix the minor issues of extcon provider driver - Drop unused remove function on extcon-fsa9480.c - Remove extraneous space before a debug message on extcon-palmas.c - Remove duplicate word in the comment - Drop useless mask_invert flag on irqchip on extcon-sm5502.c - Drop useless mask_invert flag on irqchip on extcon-rt8973a.c * tag 'extcon-next-for-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/chanwoo/extcon: extcon: Add EXTCON_DISP_CVBS and EXTCON_DISP_EDP extcon: rt8973a: Drop useless mask_invert flag on irqchip extcon: sm5502: Drop useless mask_invert flag on irqchip extcon: Drop unexpected word "the" in the comments extcon: Remove extraneous space before a debug message extcon: fsa9480: Drop no-op remove function
2022-07-16Merge tag 'icc-5.20-rc1-v2' of ↵Greg Kroah-Hartman3-0/+214
git://git.kernel.org/pub/scm/linux/kernel/git/djakov/icc into char-misc-next Georgi writes: interconnect changes for 5.20 Here are the interconnect changes for the 5.20-rc1 merge window consisting of two new drivers, misc driver improvements and new device managed API. Core change: - Add device managed bulk API Driver changes: - New driver for NXP i.MX8MP platforms - New driver for Qualcomm SM6350 platforms - Multiple bucket support for Qualcomm RPM-based drivers. Signed-off-by: Georgi Djakov <[email protected]> * tag 'icc-5.20-rc1-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/djakov/icc: PM / devfreq: imx: Register i.MX8MP interconnect device interconnect: imx: Add platform driver for imx8mp interconnect: imx: configure NoC mode/prioriry/ext_control interconnect: imx: introduce imx_icc_provider interconnect: imx: set src node interconnect: imx: fix max_node_id interconnect: qcom: icc-rpm: Set bandwidth and clock for bucket values interconnect: qcom: icc-rpm: Support multiple buckets interconnect: qcom: icc-rpm: Change to use qcom_icc_xlate_extended() interconnect: qcom: Move qcom_icc_xlate_extended() to a common file dt-bindings: interconnect: Update property for icc-rpm path tag interconnect: icc-rpm: Set destination bandwidth as well as source bandwidth interconnect: qcom: msm8939: Use icc_sync_state interconnect: add device managed bulk API dt-bindings: interconnect: add fsl,imx8mp.h dt-bindings: interconnect: imx8m: Add bindings for imx8mp noc interconnect: qcom: Add SM6350 driver support dt-bindings: interconnect: Add Qualcomm SM6350 NoC support dt-bindings: interconnect: qcom: Split out rpmh-common bindings interconnect: qcom: icc-rpmh: Support child NoC device probe
2022-07-15net: ipv4: new arp_accept option to accept garp only if in-networkJaehee Park1-1/+1
In many deployments, we want the option to not learn a neighbor from garp if the src ip is not in the same subnet as an address configured on the interface that received the garp message. net.ipv4.arp_accept sysctl is currently used to control creation of a neigh from a received garp packet. This patch adds a new option '2' to net.ipv4.arp_accept which extends option '1' by including the subnet check. Signed-off-by: Jaehee Park <[email protected]> Suggested-by: Roopa Prabhu <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
2022-07-15tcp/udp: Make early_demux back namespacified.Kuniyuki Iwashima3-6/+2
Commit e21145a9871a ("ipv4: namespacify ip_early_demux sysctl knob") made it possible to enable/disable early_demux on a per-netns basis. Then, we introduced two knobs, tcp_early_demux and udp_early_demux, to switch it for TCP/UDP in commit dddb64bcb346 ("net: Add sysctl to toggle early demux for tcp and udp"). However, the .proc_handler() was wrong and actually disabled us from changing the behaviour in each netns. We can execute early_demux if net.ipv4.ip_early_demux is on and each proto .early_demux() handler is not NULL. When we toggle (tcp|udp)_early_demux, the change itself is saved in each netns variable, but the .early_demux() handler is a global variable, so the handler is switched based on the init_net's sysctl variable. Thus, netns (tcp|udp)_early_demux knobs have nothing to do with the logic. Whether we CAN execute proto .early_demux() is always decided by init_net's sysctl knob, and whether we DO it or not is by each netns ip_early_demux knob. This patch namespacifies (tcp|udp)_early_demux again. For now, the users of the .early_demux() handler are TCP and UDP only, and they are called directly to avoid retpoline. So, we can remove the .early_demux() handler from inet6?_protos and need not dereference them in ip6?_rcv_finish_core(). If another proto needs .early_demux(), we can restore it at that time. Fixes: dddb64bcb346 ("net: Add sysctl to toggle early demux for tcp and udp") Signed-off-by: Kuniyuki Iwashima <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2022-07-15tracing/events: Add __vstring() and __assign_vstr() helper macrosSteven Rostedt (Google)6-0/+38
There's several places that open code the following logic: TP_STRUCT__entry(__dynamic_array(char, msg, MSG_MAX)), TP_fast_assign(vsnprintf(__get_str(msg), MSG_MAX, vaf->fmt, *vaf->va);) To load a string created by variable array va_list. The main issue with this approach is that "MSG_MAX" usage in the __dynamic_array() portion. That actually just reserves the MSG_MAX in the event, and even wastes space because there's dynamic meta data also saved in the event to denote the offset and size of the dynamic array. It would have been better to just use a static __array() field. Instead, create __vstring() and __assign_vstr() that work like __string and __assign_str() but instead of taking a destination string to copy, take a format string and a va_list pointer and fill in the values. It uses the helper: #define __trace_event_vstr_len(fmt, va) \ ({ \ va_list __ap; \ int __ret; \ \ va_copy(__ap, *(va)); \ __ret = vsnprintf(NULL, 0, fmt, __ap) + 1; \ va_end(__ap); \ \ min(__ret, TRACE_EVENT_STR_MAX); \ }) To figure out the length to store the string. It may be slightly slower as it needs to run the vsnprintf() twice, but it now saves space on the ring buffer. Link: https://lkml.kernel.org/r/[email protected] Cc: Dennis Dalessandro <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Leon Romanovsky <[email protected]> Cc: Kalle Valo <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Eric Dumazet <[email protected]> Cc: Jakub Kicinski <[email protected]> Cc: Paolo Abeni <[email protected]> Cc: Arend van Spriel <[email protected]> Cc: Franky Lin <[email protected]> Cc: Hante Meuleman <[email protected]> Cc: Gregory Greenman <[email protected]> Cc: Peter Chen <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Mathias Nyman <[email protected]> Cc: Chunfeng Yun <[email protected]> Cc: Bin Liu <[email protected]> Cc: Marek Lindner <[email protected]> Cc: Simon Wunderlich <[email protected]> Cc: Antonio Quartulli <[email protected]> Cc: Sven Eckelmann <[email protected]> Cc: Johannes Berg <[email protected]> Cc: Jim Cromie <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2022-07-15acl: make posix_acl_clone() available to overlayfsChristian Brauner1-0/+1
The ovl_get_acl() function needs to alter the POSIX ACLs retrieved from the lower filesystem. Instead of hand-rolling a overlayfs specific posix_acl_clone() variant allow export it. It's not special and it's not deeply internal anyway. Link: https://lore.kernel.org/r/[email protected] Cc: Seth Forshee <[email protected]> Cc: Amir Goldstein <[email protected]> Cc: Vivek Goyal <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Aleksa Sarai <[email protected]> Cc: Miklos Szeredi <[email protected]> Cc: [email protected] Cc: [email protected] Reviewed-by: Seth Forshee <[email protected]> Signed-off-by: Christian Brauner (Microsoft) <[email protected]>
2022-07-15acl: move idmapped mount fixup into vfs_{g,s}etxattr()Christian Brauner2-13/+23
This cycle we added support for mounting overlayfs on top of idmapped mounts. Recently I've started looking into potential corner cases when trying to add additional tests and I noticed that reporting for POSIX ACLs is currently wrong when using idmapped layers with overlayfs mounted on top of it. I'm going to give a rather detailed explanation to both the origin of the problem and the solution. Let's assume the user creates the following directory layout and they have a rootfs /var/lib/lxc/c1/rootfs. The files in this rootfs are owned as you would expect files on your host system to be owned. For example, ~/.bashrc for your regular user would be owned by 1000:1000 and /root/.bashrc would be owned by 0:0. IOW, this is just regular boring filesystem tree on an ext4 or xfs filesystem. The user chooses to set POSIX ACLs using the setfacl binary granting the user with uid 4 read, write, and execute permissions for their .bashrc file: setfacl -m u:4:rwx /var/lib/lxc/c2/rootfs/home/ubuntu/.bashrc Now they to expose the whole rootfs to a container using an idmapped mount. So they first create: mkdir -pv /vol/contpool/{ctrover,merge,lowermap,overmap} mkdir -pv /vol/contpool/ctrover/{over,work} chown 10000000:10000000 /vol/contpool/ctrover/{over,work} The user now creates an idmapped mount for the rootfs: mount-idmapped/mount-idmapped --map-mount=b:0:10000000:65536 \ /var/lib/lxc/c2/rootfs \ /vol/contpool/lowermap This for example makes it so that /var/lib/lxc/c2/rootfs/home/ubuntu/.bashrc which is owned by uid and gid 1000 as being owned by uid and gid 10001000 at /vol/contpool/lowermap/home/ubuntu/.bashrc. Assume the user wants to expose these idmapped mounts through an overlayfs mount to a container. mount -t overlay overlay \ -o lowerdir=/vol/contpool/lowermap, \ upperdir=/vol/contpool/overmap/over, \ workdir=/vol/contpool/overmap/work \ /vol/contpool/merge The user can do this in two ways: (1) Mount overlayfs in the initial user namespace and expose it to the container. (2) Mount overlayfs on top of the idmapped mounts inside of the container's user namespace. Let's assume the user chooses the (1) option and mounts overlayfs on the host and then changes into a container which uses the idmapping 0:10000000:65536 which is the same used for the two idmapped mounts. Now the user tries to retrieve the POSIX ACLs using the getfacl command getfacl -n /vol/contpool/lowermap/home/ubuntu/.bashrc and to their surprise they see: # file: vol/contpool/merge/home/ubuntu/.bashrc # owner: 1000 # group: 1000 user::rw- user:4294967295:rwx group::r-- mask::rwx other::r-- indicating the the uid wasn't correctly translated according to the idmapped mount. The problem is how we currently translate POSIX ACLs. Let's inspect the callchain in this example: idmapped mount /vol/contpool/merge: 0:10000000:65536 caller's idmapping: 0:10000000:65536 overlayfs idmapping (ofs->creator_cred): 0:0:4k /* initial idmapping */ sys_getxattr() -> path_getxattr() -> getxattr() -> do_getxattr() |> vfs_getxattr() | -> __vfs_getxattr() | -> handler->get == ovl_posix_acl_xattr_get() | -> ovl_xattr_get() | -> vfs_getxattr() | -> __vfs_getxattr() | -> handler->get() /* lower filesystem callback */ |> posix_acl_fix_xattr_to_user() { 4 = make_kuid(&init_user_ns, 4); 4 = mapped_kuid_fs(&init_user_ns /* no idmapped mount */, 4); /* FAILURE */ -1 = from_kuid(0:10000000:65536 /* caller's idmapping */, 4); } If the user chooses to use option (2) and mounts overlayfs on top of idmapped mounts inside the container things don't look that much better: idmapped mount /vol/contpool/merge: 0:10000000:65536 caller's idmapping: 0:10000000:65536 overlayfs idmapping (ofs->creator_cred): 0:10000000:65536 sys_getxattr() -> path_getxattr() -> getxattr() -> do_getxattr() |> vfs_getxattr() | -> __vfs_getxattr() | -> handler->get == ovl_posix_acl_xattr_get() | -> ovl_xattr_get() | -> vfs_getxattr() | -> __vfs_getxattr() | -> handler->get() /* lower filesystem callback */ |> posix_acl_fix_xattr_to_user() { 4 = make_kuid(&init_user_ns, 4); 4 = mapped_kuid_fs(&init_user_ns, 4); /* FAILURE */ -1 = from_kuid(0:10000000:65536 /* caller's idmapping */, 4); } As is easily seen the problem arises because the idmapping of the lower mount isn't taken into account as all of this happens in do_gexattr(). But do_getxattr() is always called on an overlayfs mount and inode and thus cannot possible take the idmapping of the lower layers into account. This problem is similar for fscaps but there the translation happens as part of vfs_getxattr() already. Let's walk through an fscaps overlayfs callchain: setcap 'cap_net_raw+ep' /var/lib/lxc/c2/rootfs/home/ubuntu/.bashrc The expected outcome here is that we'll receive the cap_net_raw capability as we are able to map the uid associated with the fscap to 0 within our container. IOW, we want to see 0 as the result of the idmapping translations. If the user chooses option (1) we get the following callchain for fscaps: idmapped mount /vol/contpool/merge: 0:10000000:65536 caller's idmapping: 0:10000000:65536 overlayfs idmapping (ofs->creator_cred): 0:0:4k /* initial idmapping */ sys_getxattr() -> path_getxattr() -> getxattr() -> do_getxattr() -> vfs_getxattr() -> xattr_getsecurity() -> security_inode_getsecurity() ________________________________ -> cap_inode_getsecurity() | | { V | 10000000 = make_kuid(0:0:4k /* overlayfs idmapping */, 10000000); | 10000000 = mapped_kuid_fs(0:0:4k /* no idmapped mount */, 10000000); | /* Expected result is 0 and thus that we own the fscap. */ | 0 = from_kuid(0:10000000:65536 /* caller's idmapping */, 10000000); | } | -> vfs_getxattr_alloc() | -> handler->get == ovl_other_xattr_get() | -> vfs_getxattr() | -> xattr_getsecurity() | -> security_inode_getsecurity() | -> cap_inode_getsecurity() | { | 0 = make_kuid(0:0:4k /* lower s_user_ns */, 0); | 10000000 = mapped_kuid_fs(0:10000000:65536 /* idmapped mount */, 0); | 10000000 = from_kuid(0:0:4k /* overlayfs idmapping */, 10000000); | |____________________________________________________________________| } -> vfs_getxattr_alloc() -> handler->get == /* lower filesystem callback */ And if the user chooses option (2) we get: idmapped mount /vol/contpool/merge: 0:10000000:65536 caller's idmapping: 0:10000000:65536 overlayfs idmapping (ofs->creator_cred): 0:10000000:65536 sys_getxattr() -> path_getxattr() -> getxattr() -> do_getxattr() -> vfs_getxattr() -> xattr_getsecurity() -> security_inode_getsecurity() _______________________________ -> cap_inode_getsecurity() | | { V | 10000000 = make_kuid(0:10000000:65536 /* overlayfs idmapping */, 0); | 10000000 = mapped_kuid_fs(0:0:4k /* no idmapped mount */, 10000000); | /* Expected result is 0 and thus that we own the fscap. */ | 0 = from_kuid(0:10000000:65536 /* caller's idmapping */, 10000000); | } | -> vfs_getxattr_alloc() | -> handler->get == ovl_other_xattr_get() | |-> vfs_getxattr() | -> xattr_getsecurity() | -> security_inode_getsecurity() | -> cap_inode_getsecurity() | { | 0 = make_kuid(0:0:4k /* lower s_user_ns */, 0); | 10000000 = mapped_kuid_fs(0:10000000:65536 /* idmapped mount */, 0); | 0 = from_kuid(0:10000000:65536 /* overlayfs idmapping */, 10000000); | |____________________________________________________________________| } -> vfs_getxattr_alloc() -> handler->get == /* lower filesystem callback */ We can see how the translation happens correctly in those cases as the conversion happens within the vfs_getxattr() helper. For POSIX ACLs we need to do something similar. However, in contrast to fscaps we cannot apply the fix directly to the kernel internal posix acl data structure as this would alter the cached values and would also require a rework of how we currently deal with POSIX ACLs in general which almost never take the filesystem idmapping into account (the noteable exception being FUSE but even there the implementation is special) and instead retrieve the raw values based on the initial idmapping. The correct values are then generated right before returning to userspace. The fix for this is to move taking the mount's idmapping into account directly in vfs_getxattr() instead of having it be part of posix_acl_fix_xattr_to_user(). To this end we split out two small and unexported helpers posix_acl_getxattr_idmapped_mnt() and posix_acl_setxattr_idmapped_mnt(). The former to be called in vfs_getxattr() and the latter to be called in vfs_setxattr(). Let's go back to the original example. Assume the user chose option (1) and mounted overlayfs on top of idmapped mounts on the host: idmapped mount /vol/contpool/merge: 0:10000000:65536 caller's idmapping: 0:10000000:65536 overlayfs idmapping (ofs->creator_cred): 0:0:4k /* initial idmapping */ sys_getxattr() -> path_getxattr() -> getxattr() -> do_getxattr() |> vfs_getxattr() | |> __vfs_getxattr() | | -> handler->get == ovl_posix_acl_xattr_get() | | -> ovl_xattr_get() | | -> vfs_getxattr() | | |> __vfs_getxattr() | | | -> handler->get() /* lower filesystem callback */ | | |> posix_acl_getxattr_idmapped_mnt() | | { | | 4 = make_kuid(&init_user_ns, 4); | | 10000004 = mapped_kuid_fs(0:10000000:65536 /* lower idmapped mount */, 4); | | 10000004 = from_kuid(&init_user_ns, 10000004); | | |_______________________ | | } | | | | | |> posix_acl_getxattr_idmapped_mnt() | | { | | V | 10000004 = make_kuid(&init_user_ns, 10000004); | 10000004 = mapped_kuid_fs(&init_user_ns /* no idmapped mount */, 10000004); | 10000004 = from_kuid(&init_user_ns, 10000004); | } |_________________________________________________ | | | | |> posix_acl_fix_xattr_to_user() | { V 10000004 = make_kuid(0:0:4k /* init_user_ns */, 10000004); /* SUCCESS */ 4 = from_kuid(0:10000000:65536 /* caller's idmapping */, 10000004); } And similarly if the user chooses option (1) and mounted overayfs on top of idmapped mounts inside the container: idmapped mount /vol/contpool/merge: 0:10000000:65536 caller's idmapping: 0:10000000:65536 overlayfs idmapping (ofs->creator_cred): 0:10000000:65536 sys_getxattr() -> path_getxattr() -> getxattr() -> do_getxattr() |> vfs_getxattr() | |> __vfs_getxattr() | | -> handler->get == ovl_posix_acl_xattr_get() | | -> ovl_xattr_get() | | -> vfs_getxattr() | | |> __vfs_getxattr() | | | -> handler->get() /* lower filesystem callback */ | | |> posix_acl_getxattr_idmapped_mnt() | | { | | 4 = make_kuid(&init_user_ns, 4); | | 10000004 = mapped_kuid_fs(0:10000000:65536 /* lower idmapped mount */, 4); | | 10000004 = from_kuid(&init_user_ns, 10000004); | | |_______________________ | | } | | | | | |> posix_acl_getxattr_idmapped_mnt() | | { V | 10000004 = make_kuid(&init_user_ns, 10000004); | 10000004 = mapped_kuid_fs(&init_user_ns /* no idmapped mount */, 10000004); | 10000004 = from_kuid(0(&init_user_ns, 10000004); | |_________________________________________________ | } | | | |> posix_acl_fix_xattr_to_user() | { V 10000004 = make_kuid(0:0:4k /* init_user_ns */, 10000004); /* SUCCESS */ 4 = from_kuid(0:10000000:65536 /* caller's idmappings */, 10000004); } The last remaining problem we need to fix here is ovl_get_acl(). During ovl_permission() overlayfs will call: ovl_permission() -> generic_permission() -> acl_permission_check() -> check_acl() -> get_acl() -> inode->i_op->get_acl() == ovl_get_acl() > get_acl() /* on the underlying filesystem) ->inode->i_op->get_acl() == /*lower filesystem callback */ -> posix_acl_permission() passing through the get_acl request to the underlying filesystem. This will retrieve the acls stored in the lower filesystem without taking the idmapping of the underlying mount into account as this would mean altering the cached values for the lower filesystem. So we block using ACLs for now until we decided on a nice way to fix this. Note this limitation both in the documentation and in the code. The most straightforward solution would be to have ovl_get_acl() simply duplicate the ACLs, update the values according to the idmapped mount and return it to acl_permission_check() so it can be used in posix_acl_permission() forgetting them afterwards. This is a bit heavy handed but fairly straightforward otherwise. Link: https://github.com/brauner/mount-idmapped/issues/9 Link: https://lore.kernel.org/r/[email protected] Cc: Seth Forshee <[email protected]> Cc: Amir Goldstein <[email protected]> Cc: Vivek Goyal <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Aleksa Sarai <[email protected]> Cc: Miklos Szeredi <[email protected]> Cc: [email protected] Cc: [email protected] Reviewed-by: Seth Forshee <[email protected]> Signed-off-by: Christian Brauner (Microsoft) <[email protected]>
2022-07-15mnt_idmapping: add vfs[g,u]id_into_k[g,u]id()Christian Brauner1-0/+26
Add two tiny helpers to conver a vfsuid into a kuid. Signed-off-by: Christian Brauner (Microsoft) <[email protected]>
2022-07-15Merge tag 'ovl-fixes-5.19-rc7' of ↵Christian Brauner26-93/+140
ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs into fs.idmapped.overlay.acl Bring in Miklos' tree which contains the temporary fix for POSIX ACLs with overlayfs on top of idmapped layers. We will add a proper fix on top of it and then revert the temporary fix. Cc: Seth Forshee <[email protected]> Cc: Miklos Szeredi <[email protected]> Signed-off-by: Christian Brauner (Microsoft) <[email protected]>
2022-07-15security: Add LSM hook to setgroups() syscallMicah Morton3-0/+15
Give the LSM framework the ability to filter setgroups() syscalls. There are already analagous hooks for the set*uid() and set*gid() syscalls. The SafeSetID LSM will use this new hook to ensure setgroups() calls are allowed by the installed security policy. Tested by putting print statement in security_task_fix_setgroups() hook and confirming that it gets hit when userspace does a setgroups() syscall. Acked-by: Casey Schaufler <[email protected]> Reviewed-by: Serge Hallyn <[email protected]> Signed-off-by: Micah Morton <[email protected]>
2022-07-15neighbor: tracing: Have neigh_create event use __string()Steven Rostedt (Google)1-1/+1
The dev field of the neigh_create event uses __dynamic_array() with a fixed size, which defeats the purpose of __dynamic_array(). Looking at the logic, as it already uses __assign_str(), just use the same logic in __string to create the size needed. It appears that because "dev" can be NULL, it needs the check. But __string() can have the same checks as __assign_str() so use them there too. Link: https://lkml.kernel.org/r/[email protected] Cc: David Ahern <[email protected]> Cc: David S. Miller <[email protected]> Cc: [email protected] Acked-by: Jakub Kicinski <[email protected]> Reviewed-by: David Ahern <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2022-07-15tracing/ipv4/ipv6: Use static array for name field in fib*_lookup_table eventSteven Rostedt (Google)2-7/+7
The fib_lookup_table and fib6_lookup_table events declare name as a dynamic_array, but also give it a fixed size, which defeats the purpose of the dynamic array, especially since the dynamic array also includes meta data in the event to specify its size. Since the size of the name is at most 16 bytes (defined by IFNAMSIZ), it is not worth spending the effort to determine the size of the string. Just use a fixed size array and copy into it. This will save 4 bytes that are used for the meta data that saves the size and position of a dynamic array, and even slightly speed up the event processing. Link: https://lkml.kernel.org/r/[email protected] Cc: David S. Miller <[email protected]> Cc: [email protected] Acked-by: Jakub Kicinski <[email protected]> Reviewed-by: David Ahern <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2022-07-15Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2-1/+11
Pull KVM fixes from Paolo Bonzini: "RISC-V: - Fix missing PAGE_PFN_MASK - Fix SRCU deadlock caused by kvm_riscv_check_vcpu_requests() x86: - Fix for nested virtualization when TSC scaling is active - Estimate the size of fastcc subroutines conservatively, avoiding disastrous underestimation when return thunks are enabled - Avoid possible use of uninitialized fields of 'struct kvm_lapic_irq' Generic: - Mark as such the boolean values available from the statistics file descriptors - Clarify statistics documentation" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: emulate: do not adjust size of fastop and setcc subroutines KVM: x86: Fully initialize 'struct kvm_lapic_irq' in kvm_pv_kick_cpu_op() Documentation: kvm: clarify histogram units kvm: stats: tell userspace which values are boolean x86/kvm: fix FASTOP_SIZE when return thunks are enabled KVM: nVMX: Always enable TSC scaling for L2 when it was enabled for L1 RISC-V: KVM: Fix SRCU deadlock caused by kvm_riscv_check_vcpu_requests() riscv: Fix missing PAGE_PFN_MASK
2022-07-15Merge tag 'ceph-for-5.19-rc7' of https://github.com/ceph/ceph-clientLinus Torvalds1-1/+1
Pull ceph fix from Ilya Dryomov: "A folio locking fixup that Xiubo and David cooperated on, marked for stable. Most of it is in netfs but I picked it up into ceph tree on agreement with David" * tag 'ceph-for-5.19-rc7' of https://github.com/ceph/ceph-client: netfs: do not unlock and put the folio twice
2022-07-15firmware: arm_scmi: Get detailed power scale from perfLukasz Luba1-1/+7
In SCMI v3.1 the power scale can be in micro-Watts. The upper layers, e.g. cpufreq and EM should handle received power values properly (upscale when needed). Thus, provide an interface which allows to check what is the scale for power values. The old interface allowed to distinguish between bogo-Watts and milli-Watts only (which was good for older SCMI spec). Acked-by: Sudeep Holla <[email protected]> Signed-off-by: Lukasz Luba <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2022-07-15PM: EM: convert power field to micro-Watts precision and align driversLukasz Luba1-16/+38
The milli-Watts precision causes rounding errors while calculating efficiency cost for each OPP. This is especially visible in the 'simple' Energy Model (EM), where the power for each OPP is provided from OPP framework. This can cause some OPPs to be marked inefficient, while using micro-Watts precision that might not happen. Update all EM users which access 'power' field and assume the value is in milli-Watts. Solve also an issue with potential overflow in calculation of energy estimation on 32bit machine. It's needed now since the power value (thus the 'cost' as well) are higher. Example calculation which shows the rounding error and impact: power = 'dyn-power-coeff' * volt_mV * volt_mV * freq_MHz power_a_uW = (100 * 600mW * 600mW * 500MHz) / 10^6 = 18000 power_a_mW = (100 * 600mW * 600mW * 500MHz) / 10^9 = 18 power_b_uW = (100 * 605mW * 605mW * 600MHz) / 10^6 = 21961 power_b_mW = (100 * 605mW * 605mW * 600MHz) / 10^9 = 21 max_freq = 2000MHz cost_a_mW = 18 * 2000MHz/500MHz = 72 cost_a_uW = 18000 * 2000MHz/500MHz = 72000 cost_b_mW = 21 * 2000MHz/600MHz = 70 // <- artificially better cost_b_uW = 21961 * 2000MHz/600MHz = 73203 The 'cost_b_mW' (which is based on old milli-Watts) is misleadingly better that the 'cost_b_uW' (this patch uses micro-Watts) and such would have impact on the 'inefficient OPPs' information in the Cpufreq framework. This patch set removes the rounding issue. Signed-off-by: Lukasz Luba <[email protected]> Acked-by: Daniel Lezcano <[email protected]> Acked-by: Viresh Kumar <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2022-07-15Merge tag 'soc-fixes-5.19-3' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc Pull ARM SoC fixes from Arnd Bergmann: "Most of the contents are bugfixes for the devicetree files: - A Qualcomm MSM8974 pin controller regression, caused by a cleanup patch that gets partially reverted here. - Missing properties for Broadcom BCM49xx to fix timer detection and SMP boot. - Fix touchscreen pinctrl for imx6ull-colibri board - Multiple fixes for Rockchip rk3399 based machines including the vdu clock-rate fix, otg port fix on Quartz64-A and ethernet on Quartz64-B - Fixes for misspelled DT contents causing minor problems on imx6qdl-ts7970m, orangepi-zero, sama5d2, kontron-kswitch-d10, and ls1028a And a couple of changes elsewhere: - Fix binding for Allwinner D1 display pipeline - Trivial code fixes to the TEE and reset controller driver subsystems and the rockchip platform code. - Multiple updates to the MAINTAINERS files, marking the Palm Treo support as orphaned, and fixing some entries for added or changed file names" * tag 'soc-fixes-5.19-3' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (21 commits) arm64: dts: broadcom: bcm4908: Fix cpu node for smp boot arm64: dts: broadcom: bcm4908: Fix timer node for BCM4906 SoC ARM: dts: sunxi: Fix SPI NOR campatible on Orange Pi Zero ARM: dts: at91: sama5d2: Fix typo in i2s1 node tee: tee_get_drvdata(): fix description of return value optee: Remove duplicate 'of' in two places. ARM: dts: kswitch-d10: use open drain mode for coma-mode pins ARM: dts: colibri-imx6ull: fix snvs pinmux group optee: smc_abi.c: fix wrong pointer passed to IS_ERR/PTR_ERR() MAINTAINERS: add polarfire rng, pci and clock drivers MAINTAINERS: mark ARM/PALM TREO SUPPORT orphan ARM: dts: imx6qdl-ts7970: Fix ngpio typo and count arm64: dts: ls1028a: Update SFP node to include clock dt-bindings: display: sun4i: Fix D1 pipeline count ARM: dts: qcom: msm8974: re-add missing pinctrl reset: Fix devm bulk optional exclusive control getter MAINTAINERS: rectify entry for SYNOPSYS AXS10x RESET CONTROLLER DRIVER ARM: rockchip: Add missing of_node_put() in rockchip_suspend_init() arm64: dts: rockchip: Assign RK3399 VDU clock rate arm64: dts: rockchip: Fix Quartz64-A dwc3 otg port behavior ...