aboutsummaryrefslogtreecommitdiff
path: root/drivers/iommu/intel
AgeCommit message (Collapse)AuthorFilesLines
2023-04-13iommu/vt-d: Move iopf code from SVA to IOPF enabling pathLu Baolu1-14/+18
Generally enabling IOMMU_DEV_FEAT_SVA requires IOMMU_DEV_FEAT_IOPF, but some devices manage I/O Page Faults themselves instead of relying on the IOMMU. Move IOPF related code from SVA to IOPF enabling path. For the device drivers that relies on the IOMMU for IOPF through PCI/PRI, IOMMU_DEV_FEAT_IOPF must be enabled before and disabled after IOMMU_DEV_FEAT_SVA. Reviewed-by: Kevin Tian <[email protected]> Signed-off-by: Lu Baolu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Joerg Roedel <[email protected]>
2023-04-13iommu/vt-d: Allow SVA with device-specific IOPFLu Baolu1-1/+15
Currently enabling SVA requires IOPF support from the IOMMU and device PCI PRI. However, some devices can handle IOPF by itself without ever sending PCI page requests nor advertising PRI capability. Allow SVA support with IOPF handled either by IOMMU (PCI PRI) or device driver (device-specific IOPF). As long as IOPF could be handled, SVA should continue to work. Reviewed-by: Kevin Tian <[email protected]> Signed-off-by: Lu Baolu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Joerg Roedel <[email protected]>
2023-03-31iommu/vt-d: Fix an IOMMU perfmon warning when CPU hotplugKan Liang2-24/+46
A warning can be triggered when hotplug CPU 0. $ echo 0 > /sys/devices/system/cpu/cpu0/online ------------[ cut here ]------------ Voluntary context switch within RCU read-side critical section! WARNING: CPU: 0 PID: 19 at kernel/rcu/tree_plugin.h:318 rcu_note_context_switch+0x4f4/0x580 RIP: 0010:rcu_note_context_switch+0x4f4/0x580 Call Trace: <TASK> ? perf_event_update_userpage+0x104/0x150 __schedule+0x8d/0x960 ? perf_event_set_state.part.82+0x11/0x50 schedule+0x44/0xb0 schedule_timeout+0x226/0x310 ? __perf_event_disable+0x64/0x1a0 ? _raw_spin_unlock+0x14/0x30 wait_for_completion+0x94/0x130 __wait_rcu_gp+0x108/0x130 synchronize_rcu+0x67/0x70 ? invoke_rcu_core+0xb0/0xb0 ? __bpf_trace_rcu_stall_warning+0x10/0x10 perf_pmu_migrate_context+0x121/0x370 iommu_pmu_cpu_offline+0x6a/0xa0 ? iommu_pmu_del+0x1e0/0x1e0 cpuhp_invoke_callback+0x129/0x510 cpuhp_thread_fun+0x94/0x150 smpboot_thread_fn+0x183/0x220 ? sort_range+0x20/0x20 kthread+0xe6/0x110 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x1f/0x30 </TASK> ---[ end trace 0000000000000000 ]--- The synchronize_rcu() will be invoked in the perf_pmu_migrate_context(), when migrating a PMU to a new CPU. However, the current for_each_iommu() is within RCU read-side critical section. Two methods were considered to fix the issue. - Use the dmar_global_lock to replace the RCU read lock when going through the drhd list. But it triggers a lockdep warning. - Use the cpuhp_setup_state_multi() to set up a dedicated state for each IOMMU PMU. The lock can be avoided. The latter method is implemented in this patch. Since each IOMMU PMU has a dedicated state, add cpuhp_node and cpu in struct iommu_pmu to track the state. The state can be dynamically allocated now. Remove the CPUHP_AP_PERF_X86_IOMMU_PERF_ONLINE. Fixes: 46284c6ceb5e ("iommu/vt-d: Support cpumask for IOMMU perfmon") Reported-by: Ammy Yi <[email protected]> Signed-off-by: Kan Liang <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Lu Baolu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Joerg Roedel <[email protected]>
2023-03-31iommu/vt-d: Allow zero SAGAW if second-stage not supportedLu Baolu1-1/+2
The VT-d spec states (in section 11.4.2) that hardware implementations reporting second-stage translation support (SSTS) field as Clear also report the SAGAW field as 0. Fix an inappropriate check in alloc_iommu(). Fixes: 792fb43ce2c9 ("iommu/vt-d: Enable Intel IOMMU scalable mode by default") Suggested-by: Raghunathan Srinivasan <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Signed-off-by: Jacob Pan <[email protected]> Signed-off-by: Lu Baolu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Joerg Roedel <[email protected]>
2023-03-31iommu/vt-d: Remove unnecessary locking in intel_irq_remapping_alloc()Lu Baolu1-6/+0
The global rwsem dmar_global_lock was introduced by commit 3a5670e8ac932 ("iommu/vt-d: Introduce a rwsem to protect global data structures"). It is used to protect DMAR related global data from DMAR hotplug operations. Using dmar_global_lock in intel_irq_remapping_alloc() is unnecessary as the DMAR global data structures are not touched there. Remove it to avoid below lockdep warning. ====================================================== WARNING: possible circular locking dependency detected 6.3.0-rc2 #468 Not tainted ------------------------------------------------------ swapper/0/1 is trying to acquire lock: ff1db4cb40178698 (&domain->mutex){+.+.}-{3:3}, at: __irq_domain_alloc_irqs+0x3b/0xa0 but task is already holding lock: ffffffffa0c1cdf0 (dmar_global_lock){++++}-{3:3}, at: intel_iommu_init+0x58e/0x880 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (dmar_global_lock){++++}-{3:3}: lock_acquire+0xd6/0x320 down_read+0x42/0x180 intel_irq_remapping_alloc+0xad/0x750 mp_irqdomain_alloc+0xb8/0x2b0 irq_domain_alloc_irqs_locked+0x12f/0x2d0 __irq_domain_alloc_irqs+0x56/0xa0 alloc_isa_irq_from_domain.isra.7+0xa0/0xe0 mp_map_pin_to_irq+0x1dc/0x330 setup_IO_APIC+0x128/0x210 apic_intr_mode_init+0x67/0x110 x86_late_time_init+0x24/0x40 start_kernel+0x41e/0x7e0 secondary_startup_64_no_verify+0xe0/0xeb -> #0 (&domain->mutex){+.+.}-{3:3}: check_prevs_add+0x160/0xef0 __lock_acquire+0x147d/0x1950 lock_acquire+0xd6/0x320 __mutex_lock+0x9c/0xfc0 __irq_domain_alloc_irqs+0x3b/0xa0 dmar_alloc_hwirq+0x9e/0x120 iommu_pmu_register+0x11d/0x200 intel_iommu_init+0x5de/0x880 pci_iommu_init+0x12/0x40 do_one_initcall+0x65/0x350 kernel_init_freeable+0x3ca/0x610 kernel_init+0x1a/0x140 ret_from_fork+0x29/0x50 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(dmar_global_lock); lock(&domain->mutex); lock(dmar_global_lock); lock(&domain->mutex); *** DEADLOCK *** Fixes: 9dbb8e3452ab ("irqdomain: Switch to per-domain locking") Reviewed-by: Jacob Pan <[email protected]> Tested-by: Jason Gunthorpe <[email protected]> Signed-off-by: Lu Baolu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Joerg Roedel <[email protected]>
2023-03-31iommu: Remove ioasid infrastructureJason Gunthorpe2-2/+0
This has no use anymore, delete it all. Reviewed-by: Kevin Tian <[email protected]> Reviewed-by: Lu Baolu <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> Signed-off-by: Jacob Pan <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Joerg Roedel <[email protected]>
2023-03-31iommu/ioasid: Rename INVALID_IOASIDJacob Pan3-4/+4
INVALID_IOASID and IOMMU_PASID_INVALID are duplicated. Rename INVALID_IOASID and consolidate since we are moving away from IOASID infrastructure. Reviewed-by: Dave Jiang <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Reviewed-by: Lu Baolu <[email protected]> Signed-off-by: Jacob Pan <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Joerg Roedel <[email protected]>
2023-03-31iommu/vt-d: Remove virtual command interfaceJacob Pan4-91/+0
Virtual command interface was introduced to allow using host PASIDs inside VMs. It is unused and abandoned due to architectural change. With this patch, we can safely remove this feature and the related helpers. Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jacob Pan <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Reviewed-by: Lu Baolu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Joerg Roedel <[email protected]>
2023-03-22iommu: Use sysfs_emit() for sysfs showLu Baolu1-8/+9
Use sysfs_emit() instead of the sprintf() for sysfs entries. sysfs_emit() knows the maximum of the temporary buffer used for outputting sysfs content and avoids overrunning the buffer length. Prefer 'long long' over 'long long int' as suggested by checkpatch.pl. Signed-off-by: Lu Baolu <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Joerg Roedel <[email protected]>
2023-02-24Merge tag 'for-linus-iommufd' of ↵Linus Torvalds2-3/+2
git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd Pull iommufd updates from Jason Gunthorpe: "Some polishing and small fixes for iommufd: - Remove IOMMU_CAP_INTR_REMAP, instead rely on the interrupt subsystem - Use GFP_KERNEL_ACCOUNT inside the iommu_domains - Support VFIO_NOIOMMU mode with iommufd - Various typos - A list corruption bug if HWPTs are used for attach" * tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd: iommufd: Do not add the same hwpt to the ioas->hwpt_list twice iommufd: Make sure to zero vfio_iommu_type1_info before copying to user vfio: Support VFIO_NOIOMMU with iommufd iommufd: Add three missing structures in ucmd_buffer selftests: iommu: Fix test_cmd_destroy_access() call in user_copy iommu: Remove IOMMU_CAP_INTR_REMAP irq/s390: Add arch_is_isolated_msi() for s390 iommu/x86: Replace IOMMU_CAP_INTR_REMAP with IRQ_DOMAIN_FLAG_ISOLATED_MSI genirq/msi: Rename IRQ_DOMAIN_MSI_REMAP to IRQ_DOMAIN_ISOLATED_MSI genirq/irqdomain: Remove unused irq_domain_check_msi_remap() code iommufd: Convert to msi_device_has_isolated_msi() vfio/type1: Convert to iommu_group_has_isolated_msi() iommu: Add iommu_group_has_isolated_msi() genirq/msi: Add msi_device_has_isolated_msi()
2023-02-18Merge branches 'apple/dart', 'arm/exynos', 'arm/renesas', 'arm/smmu', ↵Joerg Roedel9-89/+1244
'x86/vt-d', 'x86/amd' and 'core' into next
2023-02-16iommu/vt-d: Allow to use flush-queue when first level is defaultTina Zhang1-1/+2
Commit 29b32839725f ("iommu/vt-d: Do not use flush-queue when caching-mode is on") forced default domains to be strict mode as long as IOMMU caching-mode is flagged. The reason for doing this is that when vIOMMU uses VT-d caching mode to synchronize shadowing page tables, the strict mode shows better performance. However, this optimization is orthogonal to the first-level page table because the Intel VT-d architecture does not define the caching mode of the first-level page table. Refer to VT-d spec, section 6.1, "When the CM field is reported as Set, any software updates to remapping structures other than first-stage mapping (including updates to not- present entries or present entries whose programming resulted in translation faults) requires explicit invalidation of the caches." Exclude the first-level page table from this optimization. Generally using first-stage translation in vIOMMU implies nested translation enabled in the physical IOMMU. In this case the first-stage page table is wholly captured by the guest. The vIOMMU only needs to transfer the cache invalidations on vIOMMU to the physical IOMMU. Forcing the default domain to strict mode will cause more frequent cache invalidations, resulting in performance degradation. In a real performance benchmark test measured by iperf receive, the performance result on Sapphire Rapids 100Gb NIC shows: w/ this fix ~51 Gbits/s, w/o this fix ~39.3 Gbits/s. Theoretically a first-stage IOMMU page table can still be shadowed in absence of the caching mode, e.g. with host write-protecting guest IOMMU page table to synchronize changed PTEs with the physical IOMMU page table. In this case the shadowing overhead is decoupled from emulating IOTLB invalidation then the overhead of the latter part is solely decided by the frequency of IOTLB invalidations. Hence allowing guest default dma domain to be lazy can also benefit the overall performance by reducing the total VM-exit numbers. Fixes: 29b32839725f ("iommu/vt-d: Do not use flush-queue when caching-mode is on") Reported-by: Sanjay Kumar <[email protected]> Suggested-by: Sanjay Kumar <[email protected]> Signed-off-by: Tina Zhang <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Lu Baolu <[email protected]> Signed-off-by: Joerg Roedel <[email protected]>
2023-02-16iommu/vt-d: Fix PASID directory pointer coherencyJacob Pan1-0/+7
On platforms that do not support IOMMU Extended capability bit 0 Page-walk Coherency, CPU caches are not snooped when IOMMU is accessing any translation structures. IOMMU access goes only directly to memory. Intel IOMMU code was missing a flush for the PASID table directory that resulted in the unrecoverable fault as shown below. This patch adds clflush calls whenever allocating and updating a PASID table directory to ensure cache coherency. On the reverse direction, there's no need to clflush the PASID directory pointer when we deactivate a context entry in that IOMMU hardware will not see the old PASID directory pointer after we clear the context entry. PASID directory entries are also never freed once allocated. DMAR: DRHD: handling fault status reg 3 DMAR: [DMA Read NO_PASID] Request device [00:0d.2] fault addr 0x1026a4000 [fault reason 0x51] SM: Present bit in Directory Entry is clear DMAR: Dump dmar1 table entries for IOVA 0x1026a4000 DMAR: scalable mode root entry: hi 0x0000000102448001, low 0x0000000101b3e001 DMAR: context entry: hi 0x0000000000000000, low 0x0000000101b4d401 DMAR: pasid dir entry: 0x0000000101b4e001 DMAR: pasid table entry[0]: 0x0000000000000109 DMAR: pasid table entry[1]: 0x0000000000000001 DMAR: pasid table entry[2]: 0x0000000000000000 DMAR: pasid table entry[3]: 0x0000000000000000 DMAR: pasid table entry[4]: 0x0000000000000000 DMAR: pasid table entry[5]: 0x0000000000000000 DMAR: pasid table entry[6]: 0x0000000000000000 DMAR: pasid table entry[7]: 0x0000000000000000 DMAR: PTE not present at level 4 Cc: <[email protected]> Fixes: 0bbeb01a4faf ("iommu/vt-d: Manage scalalble mode PASID tables") Reviewed-by: Kevin Tian <[email protected]> Reported-by: Sukumar Ghorai <[email protected]> Signed-off-by: Ashok Raj <[email protected]> Signed-off-by: Jacob Pan <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Lu Baolu <[email protected]> Signed-off-by: Joerg Roedel <[email protected]>
2023-02-16iommu/vt-d: Avoid superfluous IOTLB tracking in lazy modeJacob Pan1-1/+6
Intel IOMMU driver implements IOTLB flush queue with domain selective or PASID selective invalidations. In this case there's no need to track IOVA page range and sync IOTLBs, which may cause significant performance hit. This patch adds a check to avoid IOVA gather page and IOTLB sync for the lazy path. The performance difference on Sapphire Rapids 100Gb NIC is improved by the following (as measured by iperf send): w/o this fix~48 Gbits/s. with this fix ~54 Gbits/s Cc: <[email protected]> Fixes: 2a2b8eaa5b25 ("iommu: Handle freelists when using deferred flushing in iommu drivers") Reviewed-by: Robin Murphy <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Tested-by: Sanjay Kumar <[email protected]> Signed-off-by: Sanjay Kumar <[email protected]> Signed-off-by: Jacob Pan <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Lu Baolu <[email protected]> Signed-off-by: Joerg Roedel <[email protected]>
2023-02-16iommu/vt-d: Fix error handling in sva enable/disable pathsLu Baolu1-4/+12
Roll back all previous actions in error paths of intel_iommu_enable_sva() and intel_iommu_disable_sva(). Fixes: d5b9e4bfe0d8 ("iommu/vt-d: Report prq to io-pgfault framework") Reviewed-by: Kevin Tian <[email protected]> Signed-off-by: Lu Baolu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Joerg Roedel <[email protected]>
2023-02-03iommu/vt-d: Enable IOMMU perfmon supportKan Liang2-0/+6
Register and enable an IOMMU perfmon for each active IOMMU device. The failure of IOMMU perfmon registration doesn't impact other functionalities of an IOMMU device. Signed-off-by: Kan Liang <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Lu Baolu <[email protected]> Signed-off-by: Joerg Roedel <[email protected]>
2023-02-03iommu/vt-d: Add IOMMU perfmon overflow handler supportKan Liang4-2/+95
While enabled to count events and an event occurrence causes the counter value to increment and roll over to or past zero, this is termed a counter overflow. The overflow can trigger an interrupt. The IOMMU perfmon needs to handle the case properly. New HW IRQs are allocated for each IOMMU device for perfmon. The IRQ IDs are after the SVM range. In the overflow handler, the counter is not frozen. It's very unlikely that the same counter overflows again during the period. But it's possible that other counters overflow at the same time. Read the overflow register at the end of the handler and check whether there are more. Signed-off-by: Kan Liang <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Lu Baolu <[email protected]> Signed-off-by: Joerg Roedel <[email protected]>
2023-02-03iommu/vt-d: Support cpumask for IOMMU perfmonKan Liang1-8/+105
The perf subsystem assumes that all counters are by default per-CPU. So the user space tool reads a counter from each CPU. However, the IOMMU counters are system-wide and can be read from any CPU. Here we use a CPU mask to restrict counting to one CPU to handle the issue. (with CPU hotplug notifier to choose a different CPU if the chosen one is taken off-line). The CPU is exposed to /sys/bus/event_source/devices/dmar*/cpumask for the user space perf tool. Signed-off-by: Kan Liang <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Lu Baolu <[email protected]> Signed-off-by: Joerg Roedel <[email protected]>
2023-02-03iommu/vt-d: Add IOMMU perfmon supportKan Liang3-0/+565
Implement the IOMMU performance monitor capability, which supports the collection of information about key events occurring during operation of the remapping hardware, to aid performance tuning and debug. The IOMMU perfmon support is implemented as part of the IOMMU driver and interfaces with the Linux perf subsystem. The IOMMU PMU has the following unique features compared with the other PMUs. - Support counting. Not support sampling. - Does not support per-thread counting. The scope is system-wide. - Support per-counter capability register. The event constraints can be enumerated. - The available event and event group can also be enumerated. - Extra Enhanced Commands are introduced to control the counters. Add a new variable, struct iommu_pmu *pmu, to in the struct intel_iommu to track the PMU related information. Add iommu_pmu_register() and iommu_pmu_unregister() to register and unregister a IOMMU PMU. The register function setup the IOMMU PMU ops and invoke the standard perf_pmu_register() interface to register a PMU in the perf subsystem. This patch only exposes the functions. The following patch will enable them in the IOMMU driver. The IOMMU PMUs can be found under /sys/bus/event_source/devices/dmar* The available filters and event format can be found at the format folder $ ls /sys/bus/event_source/devices/dmar1/format/ event event_group filter_ats filter_ats_en filter_page_table filter_page_table_en The supported events can be found at the events folder $ ls /sys/bus/event_source/devices/dmar1/events/ ats_blocked fs_nonleaf_hit int_cache_hit_posted iommu_mem_blocked iotlb_hit pasid_cache_lookup ss_nonleaf_hit ctxt_cache_hit fs_nonleaf_lookup int_cache_lookup iommu_mrds iotlb_lookup pg_req_posted ss_nonleaf_lookup ctxt_cache_lookup int_cache_hit_nonposted iommu_clocks iommu_requests pasid_cache_hit pw_occupancy The command below illustrates filter usage with a simple example. $ perf stat -e dmar1/iommu_requests,filter_ats_en=0x1,filter_ats=0x1/ -a sleep 1 Performance counter stats for 'system wide': 368,947 dmar1/iommu_requests,filter_ats_en=0x1,filter_ats=0x1/ 1.002592074 seconds time elapsed Signed-off-by: Kan Liang <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Lu Baolu <[email protected]> Signed-off-by: Joerg Roedel <[email protected]>
2023-02-03iommu/vt-d: Support Enhanced Command InterfaceKan Liang3-0/+99
The Enhanced Command Register is to submit command and operand of enhanced commands to DMA Remapping hardware. It can supports up to 256 enhanced commands. There is a HW register to indicate the availability of all 256 enhanced commands. Each bit stands for each command. But there isn't an existing interface to read/write all 256 bits. Introduce the u64 ecmdcap[4] to store the existence of each enhanced command. Read 4 times to get all of them in map_iommu(). Add a helper to facilitate an enhanced command launch. Make sure hardware complete the command. Also add a helper to facilitate the check of PMU essentials. These helpers will be used later. Signed-off-by: Kan Liang <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Lu Baolu <[email protected]> Signed-off-by: Joerg Roedel <[email protected]>
2023-02-03iommu/vt-d: Retrieve IOMMU perfmon capability informationKan Liang6-1/+273
The performance monitoring infrastructure, perfmon, is to support collection of information about key events occurring during operation of the remapping hardware, to aid performance tuning and debug. Each remapping hardware unit has capability registers that indicate support for performance monitoring features and enumerate the capabilities. Add alloc_iommu_pmu() to retrieve IOMMU perfmon capability information for each iommu unit. The information is stored in the iommu->pmu data structure. Capability registers are read-only, so it's safe to prefetch and store them in the pmu structure. This could avoid unnecessary VMEXIT when this code is running in the virtualization environment. Add free_iommu_pmu() to free the saved capability information when freeing the iommu unit. Add a kernel config option for the IOMMU perfmon feature. Unless a user explicitly uses the perf tool to monitor the IOMMU perfmon event, there isn't any impact for the existing IOMMU. Enable it by default. Signed-off-by: Kan Liang <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Lu Baolu <[email protected]> Signed-off-by: Joerg Roedel <[email protected]>
2023-02-03iommu/vt-d: Support size of the register set in DRHDKan Liang1-4/+7
A new field, which indicates the size of the remapping hardware register set for this remapping unit, is introduced in the DMA-remapping hardware unit definition (DRHD) structure with the VT-d Spec 4.0. With this information, SW doesn't need to 'guess' the size of the register set anymore. Update the struct acpi_dmar_hardware_unit to reflect the field. Store the size of the register set in struct dmar_drhd_unit for each dmar device. The 'size' information is ResvZ for the old BIOS and platforms. Fall back to the old guessing method. There is nothing changed. Signed-off-by: Kan Liang <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Lu Baolu <[email protected]> Signed-off-by: Joerg Roedel <[email protected]>
2023-02-03iommu/vt-d: Set No Execute Enable bit in PASID table entryLu Baolu1-0/+11
Setup No Execute Enable bit (Bit 133) of a scalable mode PASID entry. This is to allow the use of XD bit of the first level page table. Fixes: ddf09b6d43ec ("iommu/vt-d: Setup pasid entries for iova over first level") Signed-off-by: Ashok Raj <[email protected]> Signed-off-by: Lu Baolu <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Joerg Roedel <[email protected]>
2023-02-03iommu/vt-d: Remove sva from intel_svm_devLu Baolu2-15/+9
After commit be51b1d6bbff ("iommu/sva: Refactoring iommu_sva_bind/unbind_device()"), the iommu driver doesn't need to return an iommu_sva pointer anymore. This removes the sva field from intel_svm_dev and cleanups the code accordingly. Reviewed-by: Kevin Tian <[email protected]> Signed-off-by: Lu Baolu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Joerg Roedel <[email protected]>
2023-02-03iommu/vt-d: Remove users from intel_svm_devLu Baolu2-36/+27
It was used as a reference counter of an existing bond between device and user application memory address. Commit be51b1d6bbff ("iommu/sva: Refactoring iommu_sva_bind/unbind_device()") has added this in iommu core. Remove it to avoid duplicate code. Reviewed-by: Kevin Tian <[email protected]> Signed-off-by: Lu Baolu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Joerg Roedel <[email protected]>
2023-02-03iommu/vt-d: Remove unused fields in svm structuresLu Baolu2-6/+0
They aren't used anywhere. Remove them to avoid dead code. Reviewed-by: Kevin Tian <[email protected]> Signed-off-by: Lu Baolu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Joerg Roedel <[email protected]>
2023-02-03iommu/vt-d: Remove include/linux/intel-svm.hLu Baolu3-2/+5
There's no need to have a public header for Intel SVA implementation. The device driver should interact with Intel SVA implementation via the IOMMU generic APIs. Reviewed-by: Kevin Tian <[email protected]> Signed-off-by: Lu Baolu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Joerg Roedel <[email protected]>
2023-01-30Merge branch 'iommu-memory-accounting' of ↵Jason Gunthorpe3-17/+23
ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/joro/iommu intoiommufd/for-next Jason Gunthorpe says: ==================== iommufd follows the same design as KVM and uses memory cgroups to limit the amount of kernel memory a iommufd file descriptor can pin down. The various internal data structures already use GFP_KERNEL_ACCOUNT to charge its own memory. However, one of the biggest consumers of kernel memory is the IOPTEs stored under the iommu_domain and these allocations are not tracked. This series is the first step in fixing it. The iommu driver contract already includes a 'gfp' argument to the map_pages op, allowing iommufd to specify GFP_KERNEL_ACCOUNT and then having the driver allocate the IOPTE tables with that flag will capture a significant amount of the allocations. Update the iommu_map() API to pass in the GFP argument, and fix all call sites. Replace iommu_map_atomic(). Audit the "enterprise" iommu drivers to make sure they do the right thing. Intel and S390 ignore the GFP argument and always use GFP_ATOMIC. This is problematic for iommufd anyhow, so fix it. AMD and ARM SMMUv2/3 are already correct. A follow up series will be needed to capture the allocations made when the iommu_domain itself is allocated, which will complete the job. ==================== * 'iommu-memory-accounting' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/joro/iommu: iommu/s390: Use GFP_KERNEL in sleepable contexts iommu/s390: Push the gfp parameter to the kmem_cache_alloc()'s iommu/intel: Use GFP_KERNEL in sleepable contexts iommu/intel: Support the gfp argument to the map_pages op iommu/intel: Add a gfp parameter to alloc_pgtable_page() iommufd: Use GFP_KERNEL_ACCOUNT for iommu_map() iommu/dma: Use the gfp parameter in __iommu_dma_alloc_noncontiguous() iommu: Add a gfp parameter to iommu_map_sg() iommu: Remove iommu_map_atomic() iommu: Add a gfp parameter to iommu_map() Link: https://lore.kernel.org/linux-iommu/[email protected] Signed-off-by: Jason Gunthorpe <[email protected]>
2023-01-25iommu/intel: Use GFP_KERNEL in sleepable contextsJason Gunthorpe1-2/+2
These contexts are sleepable, so use the proper annotation. The GFP_ATOMIC was added mechanically in the prior patches. Reviewed-by: Lu Baolu <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Joerg Roedel <[email protected]>
2023-01-25iommu/intel: Support the gfp argument to the map_pages opJason Gunthorpe1-9/+15
Flow it down to alloc_pgtable_page() via pfn_to_dma_pte() and __domain_mapping(). Reviewed-by: Kevin Tian <[email protected]> Reviewed-by: Lu Baolu <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Joerg Roedel <[email protected]>
2023-01-25iommu/intel: Add a gfp parameter to alloc_pgtable_page()Jason Gunthorpe3-9/+9
This is eventually called by iommufd through intel_iommu_map_pages() and it should not be forced to atomic. Push the GFP_ATOMIC to all callers. Reviewed-by: Kevin Tian <[email protected]> Reviewed-by: Lu Baolu <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Joerg Roedel <[email protected]>
2023-01-11iommu/x86: Replace IOMMU_CAP_INTR_REMAP with IRQ_DOMAIN_FLAG_ISOLATED_MSIJason Gunthorpe2-3/+2
On x86 platforms when the HW can support interrupt remapping the iommu driver creates an irq_domain for the IR hardware and creates a child MSI irq_domain. When the global irq_remapping_enabled is set, the IR MSI domain is assigned to the PCI devices (by intel_irq_remap_add_device(), or amd_iommu_set_pci_msi_domain()) making those devices have the isolated MSI property. Due to how interrupt domains work, setting IRQ_DOMAIN_FLAG_ISOLATED_MSI on the parent IR domain will cause all struct devices attached to it to return true from msi_device_has_isolated_msi(). This replaces the IOMMU_CAP_INTR_REMAP flag as all places using IOMMU_CAP_INTR_REMAP also call msi_device_has_isolated_msi() Set the flag and delete the cap. Link: https://lore.kernel.org/r/[email protected] Tested-by: Matthew Rosato <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Reviewed-by: Lu Baolu <[email protected]> Acked-by: Thomas Gleixner <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2022-12-19Merge tag 'iommu-updates-v6.2' of ↵Linus Torvalds2-94/+90
git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu Pull iommu updates from Joerg Roedel: "Core code: - map/unmap_pages() cleanup - SVA and IOPF refactoring - Clean up and document return codes from device/domain attachment AMD driver: - Rework and extend parsing code for ivrs_ioapic, ivrs_hpet and ivrs_acpihid command line options - Some smaller cleanups Intel driver: - Blocking domain support - Cleanups S390 driver: - Fixes and improvements for attach and aperture handling PAMU driver: - Resource leak fix and cleanup Rockchip driver: - Page table permission bit fix Mediatek driver: - Improve safety from invalid dts input - Smaller fixes and improvements Exynos driver: - Fix driver initialization sequence Sun50i driver: - Remove IOMMU_DOMAIN_IDENTITY as it has not been working forever - Various other fixes" * tag 'iommu-updates-v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (74 commits) iommu/mediatek: Fix forever loop in error handling iommu/mediatek: Fix crash on isr after kexec() iommu/sun50i: Remove IOMMU_DOMAIN_IDENTITY iommu/amd: Fix typo in macro parameter name iommu/mediatek: Remove unused "mapping" member from mtk_iommu_data iommu/mediatek: Improve safety for mediatek,smi property in larb nodes iommu/mediatek: Validate number of phandles associated with "mediatek,larbs" iommu/mediatek: Add error path for loop of mm_dts_parse iommu/mediatek: Use component_match_add iommu/mediatek: Add platform_device_put for recovering the device refcnt iommu/fsl_pamu: Fix resource leak in fsl_pamu_probe() iommu/vt-d: Use real field for indication of first level iommu/vt-d: Remove unnecessary domain_context_mapped() iommu/vt-d: Rename domain_add_dev_info() iommu/vt-d: Rename iommu_disable_dev_iotlb() iommu/vt-d: Add blocking domain support iommu/vt-d: Add device_block_translation() helper iommu/vt-d: Allocate pasid table in device probe path iommu/amd: Check return value of mmu_notifier_register() iommu/amd: Fix pci device refcount leak in ppr_notifier() ...
2022-12-17Merge tag 'x86_mm_for_6.2_v2' of ↵Linus Torvalds1-8/+5
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 mm updates from Dave Hansen: "New Feature: - Randomize the per-cpu entry areas Cleanups: - Have CR3_ADDR_MASK use PHYSICAL_PAGE_MASK instead of open coding it - Move to "native" set_memory_rox() helper - Clean up pmd_get_atomic() and i386-PAE - Remove some unused page table size macros" * tag 'x86_mm_for_6.2_v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (35 commits) x86/mm: Ensure forced page table splitting x86/kasan: Populate shadow for shared chunk of the CPU entry area x86/kasan: Add helpers to align shadow addresses up and down x86/kasan: Rename local CPU_ENTRY_AREA variables to shorten names x86/mm: Populate KASAN shadow for entire per-CPU range of CPU entry area x86/mm: Recompute physical address for every page of per-CPU CEA mapping x86/mm: Rename __change_page_attr_set_clr(.checkalias) x86/mm: Inhibit _PAGE_NX changes from cpa_process_alias() x86/mm: Untangle __change_page_attr_set_clr(.checkalias) x86/mm: Add a few comments x86/mm: Fix CR3_ADDR_MASK x86/mm: Remove P*D_PAGE_MASK and P*D_PAGE_SIZE macros mm: Convert __HAVE_ARCH_P..P_GET to the new style mm: Remove pointless barrier() after pmdp_get_lockless() x86/mm/pae: Get rid of set_64bit() x86_64: Remove pointless set_64bit() usage x86/mm/pae: Be consistent with pXXp_get_and_clear() x86/mm/pae: Use WRITE_ONCE() x86/mm/pae: Don't (ab)use atomic64 mm/gup: Fix the lockless PMD access ...
2022-12-15x86_64: Remove pointless set_64bit() usagePeter Zijlstra1-8/+5
The use of set_64bit() in X86_64 only code is pretty pointless, seeing how it's a direct assignment. Remove all this nonsense. [nathanchance: unbreak irte] Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/20221022114425.168036718%40infradead.org
2022-12-14Merge tag 'for-linus-iommufd' of ↵Linus Torvalds5-112/+120
git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd Pull iommufd implementation from Jason Gunthorpe: "iommufd is the user API to control the IOMMU subsystem as it relates to managing IO page tables that point at user space memory. It takes over from drivers/vfio/vfio_iommu_type1.c (aka the VFIO container) which is the VFIO specific interface for a similar idea. We see a broad need for extended features, some being highly IOMMU device specific: - Binding iommu_domain's to PASID/SSID - Userspace IO page tables, for ARM, x86 and S390 - Kernel bypassed invalidation of user page tables - Re-use of the KVM page table in the IOMMU - Dirty page tracking in the IOMMU - Runtime Increase/Decrease of IOPTE size - PRI support with faults resolved in userspace Many of these HW features exist to support VM use cases - for instance the combination of PASID, PRI and Userspace IO Page Tables allows an implementation of DMA Shared Virtual Addressing (vSVA) within a guest. Dirty tracking enables VM live migration with SRIOV devices and PASID support allow creating "scalable IOV" devices, among other things. As these features are fundamental to a VM platform they need to be uniformly exposed to all the driver families that do DMA into VMs, which is currently VFIO and VDPA" For more background, see the extended explanations in Jason's pull request: https://lore.kernel.org/lkml/[email protected]/ * tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd: (62 commits) iommufd: Change the order of MSI setup iommufd: Improve a few unclear bits of code iommufd: Fix comment typos vfio: Move vfio group specific code into group.c vfio: Refactor dma APIs for emulated devices vfio: Wrap vfio group module init/clean code into helpers vfio: Refactor vfio_device open and close vfio: Make vfio_device_open() truly device specific vfio: Swap order of vfio_device_container_register() and open_device() vfio: Set device->group in helper function vfio: Create wrappers for group register/unregister vfio: Move the sanity check of the group to vfio_create_group() vfio: Simplify vfio_create_group() iommufd: Allow iommufd to supply /dev/vfio/vfio vfio: Make vfio_container optionally compiled vfio: Move container related MODULE_ALIAS statements into container.c vfio-iommufd: Support iommufd for emulated VFIO devices vfio-iommufd: Support iommufd for physical VFIO devices vfio-iommufd: Allow iommufd to be used in place of a container fd vfio: Use IOMMU_CAP_ENFORCE_CACHE_COHERENCY for vfio_file_enforced_coherent() ...
2022-12-12Merge tag 'irq-core-2022-12-10' of ↵Linus Torvalds2-26/+27
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq updates from Thomas Gleixner: "Updates for the interrupt core and driver subsystem: The bulk is the rework of the MSI subsystem to support per device MSI interrupt domains. This solves conceptual problems of the current PCI/MSI design which are in the way of providing support for PCI/MSI[-X] and the upcoming PCI/IMS mechanism on the same device. IMS (Interrupt Message Store] is a new specification which allows device manufactures to provide implementation defined storage for MSI messages (as opposed to PCI/MSI and PCI/MSI-X that has a specified message store which is uniform accross all devices). The PCI/MSI[-X] uniformity allowed us to get away with "global" PCI/MSI domains. IMS not only allows to overcome the size limitations of the MSI-X table, but also gives the device manufacturer the freedom to store the message in arbitrary places, even in host memory which is shared with the device. There have been several attempts to glue this into the current MSI code, but after lengthy discussions it turned out that there is a fundamental design problem in the current PCI/MSI-X implementation. This needs some historical background. When PCI/MSI[-X] support was added around 2003, interrupt management was completely different from what we have today in the actively developed architectures. Interrupt management was completely architecture specific and while there were attempts to create common infrastructure the commonalities were rudimentary and just providing shared data structures and interfaces so that drivers could be written in an architecture agnostic way. The initial PCI/MSI[-X] support obviously plugged into this model which resulted in some basic shared infrastructure in the PCI core code for setting up MSI descriptors, which are a pure software construct for holding data relevant for a particular MSI interrupt, but the actual association to Linux interrupts was completely architecture specific. This model is still supported today to keep museum architectures and notorious stragglers alive. In 2013 Intel tried to add support for hot-pluggable IO/APICs to the kernel, which was creating yet another architecture specific mechanism and resulted in an unholy mess on top of the existing horrors of x86 interrupt handling. The x86 interrupt management code was already an incomprehensible maze of indirections between the CPU vector management, interrupt remapping and the actual IO/APIC and PCI/MSI[-X] implementation. At roughly the same time ARM struggled with the ever growing SoC specific extensions which were glued on top of the architected GIC interrupt controller. This resulted in a fundamental redesign of interrupt management and provided the today prevailing concept of hierarchical interrupt domains. This allowed to disentangle the interactions between x86 vector domain and interrupt remapping and also allowed ARM to handle the zoo of SoC specific interrupt components in a sane way. The concept of hierarchical interrupt domains aims to encapsulate the functionality of particular IP blocks which are involved in interrupt delivery so that they become extensible and pluggable. The X86 encapsulation looks like this: |--- device 1 [Vector]---[Remapping]---[PCI/MSI]--|... |--- device N where the remapping domain is an optional component and in case that it is not available the PCI/MSI[-X] domains have the vector domain as their parent. This reduced the required interaction between the domains pretty much to the initialization phase where it is obviously required to establish the proper parent relation ship in the components of the hierarchy. While in most cases the model is strictly representing the chain of IP blocks and abstracting them so they can be plugged together to form a hierarchy, the design stopped short on PCI/MSI[-X]. Looking at the hardware it's clear that the actual PCI/MSI[-X] interrupt controller is not a global entity, but strict a per PCI device entity. Here we took a short cut on the hierarchical model and went for the easy solution of providing "global" PCI/MSI domains which was possible because the PCI/MSI[-X] handling is uniform across the devices. This also allowed to keep the existing PCI/MSI[-X] infrastructure mostly unchanged which in turn made it simple to keep the existing architecture specific management alive. A similar problem was created in the ARM world with support for IP block specific message storage. Instead of going all the way to stack a IP block specific domain on top of the generic MSI domain this ended in a construct which provides a "global" platform MSI domain which allows overriding the irq_write_msi_msg() callback per allocation. In course of the lengthy discussions we identified other abuse of the MSI infrastructure in wireless drivers, NTB etc. where support for implementation specific message storage was just mindlessly glued into the existing infrastructure. Some of this just works by chance on particular platforms but will fail in hard to diagnose ways when the driver is used on platforms where the underlying MSI interrupt management code does not expect the creative abuse. Another shortcoming of today's PCI/MSI-X support is the inability to allocate or free individual vectors after the initial enablement of MSI-X. This results in an works by chance implementation of VFIO (PCI pass-through) where interrupts on the host side are not set up upfront to avoid resource exhaustion. They are expanded at run-time when the guest actually tries to use them. The way how this is implemented is that the host disables MSI-X and then re-enables it with a larger number of vectors again. That works by chance because most device drivers set up all interrupts before the device actually will utilize them. But that's not universally true because some drivers allocate a large enough number of vectors but do not utilize them until it's actually required, e.g. for acceleration support. But at that point other interrupts of the device might be in active use and the MSI-X disable/enable dance can just result in losing interrupts and therefore hard to diagnose subtle problems. Last but not least the "global" PCI/MSI-X domain approach prevents to utilize PCI/MSI[-X] and PCI/IMS on the same device due to the fact that IMS is not longer providing a uniform storage and configuration model. The solution to this is to implement the missing step and switch from global PCI/MSI domains to per device PCI/MSI domains. The resulting hierarchy then looks like this: |--- [PCI/MSI] device 1 [Vector]---[Remapping]---|... |--- [PCI/MSI] device N which in turn allows to provide support for multiple domains per device: |--- [PCI/MSI] device 1 |--- [PCI/IMS] device 1 [Vector]---[Remapping]---|... |--- [PCI/MSI] device N |--- [PCI/IMS] device N This work converts the MSI and PCI/MSI core and the x86 interrupt domains to the new model, provides new interfaces for post-enable allocation/free of MSI-X interrupts and the base framework for PCI/IMS. PCI/IMS has been verified with the work in progress IDXD driver. There is work in progress to convert ARM over which will replace the platform MSI train-wreck. The cleanup of VFIO, NTB and other creative "solutions" are in the works as well. Drivers: - Updates for the LoongArch interrupt chip drivers - Support for MTK CIRQv2 - The usual small fixes and updates all over the place" * tag 'irq-core-2022-12-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (134 commits) irqchip/ti-sci-inta: Fix kernel doc irqchip/gic-v2m: Mark a few functions __init irqchip/gic-v2m: Include arm-gic-common.h irqchip/irq-mvebu-icu: Fix works by chance pointer assignment iommu/amd: Enable PCI/IMS iommu/vt-d: Enable PCI/IMS x86/apic/msi: Enable PCI/IMS PCI/MSI: Provide pci_ims_alloc/free_irq() PCI/MSI: Provide IMS (Interrupt Message Store) support genirq/msi: Provide constants for PCI/IMS support x86/apic/msi: Enable MSI_FLAG_PCI_MSIX_ALLOC_DYN PCI/MSI: Provide post-enable dynamic allocation interfaces for MSI-X PCI/MSI: Provide prepare_desc() MSI domain op PCI/MSI: Split MSI-X descriptor setup genirq/msi: Provide MSI_FLAG_MSIX_ALLOC_DYN genirq/msi: Provide msi_domain_alloc_irq_at() genirq/msi: Provide msi_domain_ops:: Prepare_desc() genirq/msi: Provide msi_desc:: Msi_data genirq/msi: Provide struct msi_map x86/apic/msi: Remove arch_create_remap_msi_irq_domain() ...
2022-12-12Merge branches 'arm/allwinner', 'arm/exynos', 'arm/mediatek', ↵Joerg Roedel5-201/+199
'arm/rockchip', 'arm/smmu', 'ppc/pamu', 's390', 'x86/vt-d', 'x86/amd' and 'core' into next
2022-12-05iommu/vt-d: Enable PCI/IMSThomas Gleixner1-3/+16
PCI/IMS works like PCI/MSI-X in the remapping. Just add the feature flag, but only when on real hardware. Virtualized IOMMUs need additional support, e.g. for PASID. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Acked-by: Marc Zyngier <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-12-05iommu/vt-d: Switch to MSI parent domainsThomas Gleixner2-16/+12
Remove the global PCI/MSI irqdomain implementation and provide the required MSI parent ops so the PCI/MSI code can detect the new parent and setup per device domains. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Acked-by: Marc Zyngier <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-12-05x86/apic/vector: Provide MSI parent domainThomas Gleixner1-1/+1
Enable MSI parent domain support in the x86 vector domain and fixup the checks in the iommu implementations to check whether device::msi::domain is the default MSI parent domain. That keeps the existing logic to protect e.g. devices behind VMD working. The interrupt remap PCI/MSI code still works because the underlying vector domain still provides the same functionality. None of the other x86 PCI/MSI, e.g. XEN and HyperV, implementations are affected either. They still work the same way both at the low level and the PCI/MSI implementations they provide. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Acked-by: Marc Zyngier <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-12-05iommu/vt-d: Fix buggy QAT device maskJacob Pan1-1/+1
Impacted QAT device IDs that need extra dtlb flush quirk is ranging from 0x4940 to 0x4943. After bitwise AND device ID with 0xfffc the result should be 0x4940 instead of 0x494c to identify these devices. Fixes: e65a6897be5e ("iommu/vt-d: Add a fix for devices need extra dtlb flush") Reported-by: Raghunathan Srinivasan <[email protected]> Signed-off-by: Ashok Raj <[email protected]> Signed-off-by: Jacob Pan <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Lu Baolu <[email protected]> Signed-off-by: Joerg Roedel <[email protected]>
2022-12-02Merge tag 'v6.1-rc7' into iommufd.git for-nextJason Gunthorpe2-7/+6
Resolve conflicts in drivers/vfio/vfio_main.c by using the iommfd version. The rc fix was done a different way when iommufd patches reworked this code. Signed-off-by: Jason Gunthorpe <[email protected]>
2022-12-02iommu/vt-d: Fix PCI device refcount leak in dmar_dev_scope_init()Xiongfeng Wang1-0/+1
for_each_pci_dev() is implemented by pci_get_device(). The comment of pci_get_device() says that it will increase the reference count for the returned pci_dev and also decrease the reference count for the input pci_dev @from if it is not NULL. If we break for_each_pci_dev() loop with pdev not NULL, we need to call pci_dev_put() to decrease the reference count. Add the missing pci_dev_put() for the error path to avoid reference count leak. Fixes: 2e4552893038 ("iommu/vt-d: Unify the way to process DMAR device scope array") Signed-off-by: Xiongfeng Wang <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Lu Baolu <[email protected]> Signed-off-by: Joerg Roedel <[email protected]>
2022-12-02iommu/vt-d: Fix PCI device refcount leak in has_external_pci()Xiongfeng Wang1-1/+3
for_each_pci_dev() is implemented by pci_get_device(). The comment of pci_get_device() says that it will increase the reference count for the returned pci_dev and also decrease the reference count for the input pci_dev @from if it is not NULL. If we break for_each_pci_dev() loop with pdev not NULL, we need to call pci_dev_put() to decrease the reference count. Add the missing pci_dev_put() before 'return true' to avoid reference count leak. Fixes: 89a6079df791 ("iommu/vt-d: Force IOMMU on for platform opt in hint") Signed-off-by: Xiongfeng Wang <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Lu Baolu <[email protected]> Signed-off-by: Joerg Roedel <[email protected]>
2022-12-02iommu/vt-d: Fix PCI device refcount leak in prq_event_thread()Yang Yingliang1-5/+9
As comment of pci_get_domain_bus_and_slot() says, it returns a pci device with refcount increment, when finish using it, the caller must decrease the reference count by calling pci_dev_put(). So call pci_dev_put() after using the 'pdev' to avoid refcount leak. Besides, if the 'pdev' is null or intel_svm_prq_report() returns error, there is no need to trace this fault. Fixes: 06f4b8d09dba ("iommu/vt-d: Remove unnecessary SVA data accesses in page fault path") Suggested-by: Lu Baolu <[email protected]> Signed-off-by: Yang Yingliang <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Lu Baolu <[email protected]> Signed-off-by: Joerg Roedel <[email protected]>
2022-12-02iommu/vt-d: Add a fix for devices need extra dtlb flushJacob Pan3-3/+75
QAT devices on Intel Sapphire Rapids and Emerald Rapids have a defect in address translation service (ATS). These devices may inadvertently issue ATS invalidation completion before posted writes initiated with translated address that utilized translations matching the invalidation address range, violating the invalidation completion ordering. This patch adds an extra device TLB invalidation for the affected devices, it is needed to ensure no more posted writes with translated address following the invalidation completion. Therefore, the ordering is preserved and data-corruption is prevented. Device TLBs are invalidated under the following six conditions: 1. Device driver does DMA API unmap IOVA 2. Device driver unbind a PASID from a process, sva_unbind_device() 3. PASID is torn down, after PASID cache is flushed. e.g. process exit_mmap() due to crash 4. Under SVA usage, called by mmu_notifier.invalidate_range() where VM has to free pages that were unmapped 5. userspace driver unmaps a DMA buffer 6. Cache invalidation in vSVA usage (upcoming) For #1 and #2, device drivers are responsible for stopping DMA traffic before unmap/unbind. For #3, iommu driver gets mmu_notifier to invalidate TLB the same way as normal user unmap which will do an extra invalidation. The dTLB invalidation after PASID cache flush does not need an extra invalidation. Therefore, we only need to deal with #4 and #5 in this patch. #1 is also covered by this patch due to common code path with #5. Tested-by: Yuzhang Luo <[email protected]> Reviewed-by: Ashok Raj <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Signed-off-by: Jacob Pan <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Lu Baolu <[email protected]> Signed-off-by: Joerg Roedel <[email protected]>
2022-11-29iommu: Add IOMMU_CAP_ENFORCE_CACHE_COHERENCYJason Gunthorpe1-5/+11
This queries if a domain linked to a device should expect to support enforce_cache_coherency() so iommufd can negotiate the rules for when a domain should be shared or not. For iommufd a device that declares IOMMU_CAP_ENFORCE_CACHE_COHERENCY will not be attached to a domain that does not support it. Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Lu Baolu <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Tested-by: Nicolin Chen <[email protected]> Tested-by: Yi Liu <[email protected]> Tested-by: Lixiao Yang <[email protected]> Tested-by: Matthew Rosato <[email protected]> Tested-by: Yu He <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2022-11-22iommu/vt-d: Use real field for indication of first levelLu Baolu2-25/+15
The dmar_domain uses bit field members to indicate the behaviors. Add a bit field for using first level and remove the flags member to avoid duplication. Signed-off-by: Lu Baolu <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Joerg Roedel <[email protected]>
2022-11-22iommu/vt-d: Remove unnecessary domain_context_mapped()Lu Baolu1-44/+3
The device_domain_info::domain accurately records the domain attached to the device. It is unnecessary to check whether the context is present in the attach_dev path. Remove it to make the code neat. Signed-off-by: Lu Baolu <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Joerg Roedel <[email protected]>