aboutsummaryrefslogtreecommitdiff
path: root/arch/x86/kernel/apic/vector.c
AgeCommit message (Collapse)AuthorFilesLines
2020-12-14Merge tag 'x86-apic-2020-12-14' of ↵Linus Torvalds1-0/+49
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 apic updates from Thomas Gleixner: "Yet another large set of x86 interrupt management updates: - Simplification and distangling of the MSI related functionality - Let IO/APIC construct the RTE entries from an MSI message instead of having IO/APIC specific code in the interrupt remapping drivers - Make the retrieval of the parent interrupt domain (vector or remap unit) less hardcoded and use the relevant irqdomain callbacks for selection. - Allow the handling of more than 255 CPUs without a virtualized IOMMU when the hypervisor supports it. This has made been possible by the above modifications and also simplifies the existing workaround in the HyperV specific virtual IOMMU. - Cleanup of the historical timer_works() irq flags related inconsistencies" * tag 'x86-apic-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (42 commits) x86/ioapic: Cleanup the timer_works() irqflags mess iommu/hyper-v: Remove I/O-APIC ID check from hyperv_irq_remapping_select() iommu/amd: Fix IOMMU interrupt generation in X2APIC mode iommu/amd: Don't register interrupt remapping irqdomain when IR is disabled iommu/amd: Fix union of bitfields in intcapxt support x86/ioapic: Correct the PCI/ISA trigger type selection x86/ioapic: Use I/O-APIC ID for finding irqdomain, not index x86/hyperv: Enable 15-bit APIC ID if the hypervisor supports it x86/kvm: Enable 15-bit extension when KVM_FEATURE_MSI_EXT_DEST_ID detected iommu/hyper-v: Disable IRQ pseudo-remapping if 15 bit APIC IDs are available x86/apic: Support 15 bits of APIC ID in MSI where available x86/ioapic: Handle Extended Destination ID field in RTE iommu/vt-d: Simplify intel_irq_remapping_select() x86: Kill all traces of irq_remapping_get_irq_domain() x86/ioapic: Use irq_find_matching_fwspec() to find remapping irqdomain x86/hpet: Use irq_find_matching_fwspec() to find remapping irqdomain iommu/hyper-v: Implement select() method on remapping irqdomain iommu/vt-d: Implement select() method on remapping irqdomain iommu/amd: Implement select() method on remapping irqdomain x86/apic: Add select() method on vector irqdomain ...
2020-12-10x86/apic/vector: Fix ordering in vector assignmentThomas Gleixner1-10/+14
Prarit reported that depending on the affinity setting the ' irq $N: Affinity broken due to vector space exhaustion.' message is showing up in dmesg, but the vector space on the CPUs in the affinity mask is definitely not exhausted. Shung-Hsi provided traces and analysis which pinpoints the problem: The ordering of trying to assign an interrupt vector in assign_irq_vector_any_locked() is simply wrong if the interrupt data has a valid node assigned. It does: 1) Try the intersection of affinity mask and node mask 2) Try the node mask 3) Try the full affinity mask 4) Try the full online mask Obviously #2 and #3 are in the wrong order as the requested affinity mask has to take precedence. In the observed cases #1 failed because the affinity mask did not contain CPUs from node 0. That made it allocate a vector from node 0, thereby breaking affinity and emitting the misleading message. Revert the order of #2 and #3 so the full affinity mask without the node intersection is tried before actually affinity is broken. If no node is assigned then only the full affinity mask and if that fails the full online mask is tried. Fixes: d6ffc6ac83b1 ("x86/vector: Respect affinity mask in irq descriptor") Reported-by: Prarit Bhargava <[email protected]> Reported-by: Shung-Hsi Yu <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Shung-Hsi Yu <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/r/[email protected]
2020-10-28x86/apic: Add select() method on vector irqdomainDavid Woodhouse1-0/+43
This will be used to select the irqdomain for I/O-APIC and HPET. Signed-off-by: David Woodhouse <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-10-28x86/apic: Always provide irq_compose_msi_msg() method for vector domainDavid Woodhouse1-0/+6
This shouldn't be dependent on PCI_MSI. HPET and I/O-APIC can deliver interrupts through MSI without having any PCI in the system at all. Signed-off-by: David Woodhouse <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-16x86/irq: Initialize PCI/MSI domain at PCI init timeThomas Gleixner1-2/+0
No point in initializing the default PCI/MSI interrupt domain early and no point to create it when XEN PV/HVM/DOM0 are active. Move the initialization to pci_arch_init() and convert it to init ops so that XEN can override it as XEN has it's own PCI/MSI management. The XEN override comes in a later step. Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-09-16x86/msi: Move compose message callback where it belongsThomas Gleixner1-0/+1
Composing the MSI message at the MSI chip level is wrong because the underlying parent domain is the one which knows how the message should be composed for the direct vector delivery or the interrupt remapping table entry. The interrupt remapping aware PCI/MSI chip does that already. Make the direct delivery chip do the same and move the composition of the direct delivery MSI message to the vector domain irq chip. This prepares for the upcoming device MSI support to avoid having architecture specific knowledge in the device MSI domain irq chips. Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-08-27x86/irq: Unbreak interrupt affinity settingThomas Gleixner1-7/+9
Several people reported that 5.8 broke the interrupt affinity setting mechanism. The consolidation of the entry code reused the regular exception entry code for device interrupts and changed the way how the vector number is conveyed from ptregs->orig_ax to a function argument. The low level entry uses the hardware error code slot to push the vector number onto the stack which is retrieved from there into a function argument and the slot on stack is set to -1. The reason for setting it to -1 is that the error code slot is at the position where pt_regs::orig_ax is. A positive value in pt_regs::orig_ax indicates that the entry came via a syscall. If it's not set to a negative value then a signal delivery on return to userspace would try to restart a syscall. But there are other places which rely on pt_regs::orig_ax being a valid indicator for syscall entry. But setting pt_regs::orig_ax to -1 has a nasty side effect vs. the interrupt affinity setting mechanism, which was overlooked when this change was made. Moving interrupts on x86 happens in several steps. A new vector on a different CPU is allocated and the relevant interrupt source is reprogrammed to that. But that's racy and there might be an interrupt already in flight to the old vector. So the old vector is preserved until the first interrupt arrives on the new vector and the new target CPU. Once that happens the old vector is cleaned up, but this cleanup still depends on the vector number being stored in pt_regs::orig_ax, which is now -1. That -1 makes the check for cleanup: pt_regs::orig_ax == new_vector always false. As a consequence the interrupt is moved once, but then it cannot be moved anymore because the cleanup of the old vector never happens. There would be several ways to convey the vector information to that place in the guts of the interrupt handling, but on deeper inspection it turned out that this check is pointless and a leftover from the old affinity model of X86 which supported multi-CPU affinities. Under this model it was possible that an interrupt had an old and a new vector on the same CPU, so the vector match was required. Under the new model the effective affinity of an interrupt is always a single CPU from the requested affinity mask. If the affinity mask changes then either the interrupt stays on the CPU and on the same vector when that CPU is still in the new affinity mask or it is moved to a different CPU, but it is never moved to a different vector on the same CPU. Ergo the cleanup check for the matching vector number is not required and can be removed which makes the dependency on pt_regs:orig_ax go away. The remaining check for new_cpu == smp_processsor_id() is completely sufficient. If it matches then the interrupt was successfully migrated and the cleanup can proceed. For paranoia sake add a warning into the vector assignment code to validate that the assumption of never moving to a different vector on the same CPU holds. Fixes: 633260fa143 ("x86/irq: Convey vector as argument and not in ptregs") Reported-by: Alex bykov <[email protected]> Reported-by: Avi Kivity <[email protected]> Reported-by: Alexander Graf <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Alexander Graf <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/r/[email protected]
2020-07-27genirq/affinity: Make affinity setting if activated opt-inThomas Gleixner1-0/+4
John reported that on a RK3288 system the perf per CPU interrupts are all affine to CPU0 and provided the analysis: "It looks like what happens is that because the interrupts are not per-CPU in the hardware, armpmu_request_irq() calls irq_force_affinity() while the interrupt is deactivated and then request_irq() with IRQF_PERCPU | IRQF_NOBALANCING. Now when irq_startup() runs with IRQ_STARTUP_NORMAL, it calls irq_setup_affinity() which returns early because IRQF_PERCPU and IRQF_NOBALANCING are set, leaving the interrupt on its original CPU." This was broken by the recent commit which blocked interrupt affinity setting in hardware before activation of the interrupt. While this works in general, it does not work for this particular case. As contrary to the initial analysis not all interrupt chip drivers implement an activate callback, the safe cure is to make the deferred interrupt affinity setting at activation time opt-in. Implement the necessary core logic and make the two irqchip implementations for which this is required opt-in. In hindsight this would have been the right thing to do, but ... Fixes: baedb87d1b53 ("genirq/affinity: Handle affinity setting on inactive interrupts correctly") Reported-by: John Keeping <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Marc Zyngier <[email protected]> Acked-by: Marc Zyngier <[email protected]> Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected]
2020-07-17genirq/affinity: Handle affinity setting on inactive interrupts correctlyThomas Gleixner1-17/+5
Setting interrupt affinity on inactive interrupts is inconsistent when hierarchical irq domains are enabled. The core code should just store the affinity and not call into the irq chip driver for inactive interrupts because the chip drivers may not be in a state to handle such requests. X86 has a hacky workaround for that but all other irq chips have not which causes problems e.g. on GIC V3 ITS. Instead of adding more ugly hacks all over the place, solve the problem in the core code. If the affinity is set on an inactive interrupt then: - Store it in the irq descriptors affinity mask - Update the effective affinity to reflect that so user space has a consistent view - Don't call into the irq chip driver This is the core equivalent of the X86 workaround and works correctly because the affinity setting is established in the irq chip when the interrupt is activated later on. Note, that this is only effective when hierarchical irq domains are enabled by the architecture. Doing it unconditionally would break legacy irq chip implementations. For hierarchial irq domains this works correctly as none of the drivers can have a dependency on affinity setting in inactive state by design. Remove the X86 workaround as it is not longer required. Fixes: 02edee152d6e ("x86/apic/vector: Ignore set_affinity call for inactive interrupts") Reported-by: Ali Saidi <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Ali Saidi <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected]
2020-07-14irqdomain/treewide: Keep firmware node unconditionally allocatedThomas Gleixner1-1/+0
Quite some non OF/ACPI users of irqdomains allocate firmware nodes of type IRQCHIP_FWNODE_NAMED or IRQCHIP_FWNODE_NAMED_ID and free them right after creating the irqdomain. The only purpose of these FW nodes is to convey name information. When this was introduced the core code did not store the pointer to the node in the irqdomain. A recent change stored the firmware node pointer in irqdomain for other reasons and missed to notice that the usage sites which do the alloc_fwnode/create_domain/free_fwnode sequence are broken by this. Storing a dangling pointer is dangerous itself, but in case that the domain is destroyed later on this leads to a double free. Remove the freeing of the firmware node after creating the irqdomain from all affected call sites to cure this. Fixes: 711419e504eb ("irqdomain: Add the missing assignment of domain->fwnode for named fwnode") Reported-by: Andy Shevchenko <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Bjorn Helgaas <[email protected]> Acked-by: Marc Zyngier <[email protected]> Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected]
2020-06-11x86/entry: Convert SMP system vectors to IDTENTRY_SYSVECThomas Gleixner1-3/+2
Convert SMP system vectors to IDTENTRY_SYSVEC: - Implement the C entry point with DEFINE_IDTENTRY_SYSVEC - Emit the ASM stub with DECLARE_IDTENTRY_SYSVEC - Remove the ASM idtentries in 64-bit - Remove the BUILD_INTERRUPT entries in 32-bit - Remove the old prototypes No functional change. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Acked-by: Andy Lutomirski <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-03-30Merge tag 'irq-core-2020-03-30' of ↵Linus Torvalds1-0/+6
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq updates from Thomas Gleixner: "Updates for the interrupt subsystem: Treewide: - Cleanup of setup_irq() which is not longer required because the memory allocator is available early. Most cleanup changes come through the various maintainer trees, so the final removal of setup_irq() is postponed towards the end of the merge window. Core: - Protection against unsafe invocation of interrupt handlers and unsafe interrupt injection including a fixup of the offending PCI/AER error injection mechanism. Invoking interrupt handlers from arbitrary contexts, i.e. outside of an actual interrupt, can cause inconsistent state on the fragile x86 interrupt affinity changing hardware trainwreck. Drivers: - Second wave of support for the new ARM GICv4.1 - Multi-instance support for Xilinx and PLIC interrupt controllers - CPU-Hotplug support for PLIC - The obligatory new driver for X1000 TCU - Enhancements, cleanups and fixes all over the place" * tag 'irq-core-2020-03-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (58 commits) unicore32: Replace setup_irq() by request_irq() sh: Replace setup_irq() by request_irq() hexagon: Replace setup_irq() by request_irq() c6x: Replace setup_irq() by request_irq() alpha: Replace setup_irq() by request_irq() irqchip/gic-v4.1: Eagerly vmap vPEs irqchip/gic-v4.1: Add VSGI property setup irqchip/gic-v4.1: Add VSGI allocation/teardown irqchip/gic-v4.1: Move doorbell management to the GICv4 abstraction layer irqchip/gic-v4.1: Plumb set_vcpu_affinity SGI callbacks irqchip/gic-v4.1: Plumb get/set_irqchip_state SGI callbacks irqchip/gic-v4.1: Plumb mask/unmask SGI callbacks irqchip/gic-v4.1: Add initial SGI configuration irqchip/gic-v4.1: Plumb skeletal VSGI irqchip irqchip/stm32: Retrigger both in eoi and unmask callbacks irqchip/gic-v3: Move irq_domain_update_bus_token to after checking for NULL domain irqchip/xilinx: Do not call irq_set_default_host() irqchip/xilinx: Enable generic irq multi handler irqchip/xilinx: Fill error code when irq domain registration fails irqchip/xilinx: Add support for multiple instances ...
2020-03-13x86/vector: Remove warning on managed interrupt migrationPeter Xu1-6/+8
The vector management code assumes that managed interrupts cannot be migrated away from an online CPU. free_moved_vector() has a WARN_ON_ONCE() which triggers when a managed interrupt vector association on a online CPU is cleared. The CPU offline code uses a different mechanism which cannot trigger this. This assumption is not longer correct because the new CPU isolation feature which affects the placement of managed interrupts must be able to move a managed interrupt away from an online CPU. There are two reasons why this can happen: 1) When the interrupt is activated the affinity mask which was established in irq_create_affinity_masks() is handed in to the vector allocation code. This mask contains all CPUs to which the interrupt can be made affine to, but this does not take the CPU isolation 'managed_irq' mask into account. When the interrupt is finally requested by the device driver then the affinity is checked again and the CPU isolation 'managed_irq' mask is taken into account, which moves the interrupt to a non-isolated CPU if possible. 2) The interrupt can be affine to an isolated CPU because the non-isolated CPUs in the calculated affinity mask are not online. Once a non-isolated CPU which is in the mask comes online the interrupt is migrated to this non-isolated CPU In both cases the regular online migration mechanism is used which triggers the WARN_ON_ONCE() in free_moved_vector(). Case #1 could have been addressed by taking the isolation mask into account, but that would require a massive code change in the activation logic and the eventual migration event was accepted as a reasonable tradeoff when the isolation feature was developed. But even if #1 would be addressed, #2 would still trigger it. Of course the warning in free_moved_vector() was overlooked at that time and the above two cases which have been discussed during patch review have obviously never been tested before the final submission. So keep it simple and remove the warning. [ tglx: Rewrote changelog and added a comment to free_moved_vector() ] Fixes: 11ea68f553e2 ("genirq, sched/isolation: Isolate from handling managed interrupts") Signed-off-by: Peter Xu <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Ming Lei <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-03-08x86/apic/vector: Force interupt handler invocation to irq contextThomas Gleixner1-0/+6
Sathyanarayanan reported that the PCI-E AER error injection mechanism can result in a NULL pointer dereference in apic_ack_edge(): BUG: unable to handle kernel NULL pointer dereference at 0000000000000078 RIP: 0010:apic_ack_edge+0x1e/0x40 Call Trace: handle_edge_irq+0x7d/0x1e0 generic_handle_irq+0x27/0x30 aer_inject_write+0x53a/0x720 It crashes in irq_complete_move() which dereferences get_irq_regs() which is obviously NULL when this is called from non interrupt context. Of course the pointer could be checked, but that just papers over the real issue. Invoking the low level interrupt handling mechanism from random code can wreckage the fragile interrupt affinity mechanism of x86 as interrupts can only be moved in interrupt context or with special care when a CPU goes offline and the move has to be enforced. In the best case this triggers the warning in the MSI affinity setter, but if the call happens on the correct CPU it just corrupts state and might prevent further interrupt delivery for the affected device. Mark the APIC interrupts as unsuitable for being invoked in random contexts. This prevents the AER injection from proliferating the wreckage, but that's less broken than the current state of affairs and more correct than just papering over the problem by sprinkling random checks all over the place and silently corrupting state. Reported-by: [email protected] Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2019-08-28x86/apic/vector: Warn when vector space exhaustion breaks affinityNeil Horman1-0/+11
On x86, CPUs are limited in the number of interrupts they can have affined to them as they only support 256 interrupt vectors per CPU. 32 vectors are reserved for the CPU and the kernel reserves another 22 for internal purposes. That leaves 202 vectors for assignement to devices. When an interrupt is set up or the affinity is changed by the kernel or the administrator, the vector assignment code attempts to honor the requested affinity mask. If the vector space on the CPUs in that affinity mask is exhausted the code falls back to a wider set of CPUs and assigns a vector on a CPU outside of the requested affinity mask silently. While the effective affinity is reflected in the corresponding /proc/irq/$N/effective_affinity* files the silent breakage of the requested affinity can lead to unexpected behaviour for administrators. Add a pr_warn() when this happens so that adminstrators get at least informed about it in the syslog. [ tglx: Massaged changelog and made the pr_warn() more informative ] Reported-by: [email protected] Signed-off-by: Neil Horman <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: [email protected] Link: https://lkml.kernel.org/r/[email protected]
2019-07-08Merge branch 'x86-apic-for-linus' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x96 apic updates from Thomas Gleixner: "Updates for the x86 APIC interrupt handling and APIC timer: - Fix a long standing issue with spurious interrupts which was caused by the big vector management rework a few years ago. Robert Hodaszi provided finally enough debug data and an excellent initial failure analysis which allowed to understand the underlying issues. This contains a change to the core interrupt management code which is required to handle this correctly for the APIC/IO_APIC. The core changes are NOOPs for most architectures except ARM64. ARM64 is not impacted by the change as confirmed by Marc Zyngier. - Newer systems allow to disable the PIT clock for power saving causing panic in the timer interrupt delivery check of the IO/APIC when the HPET timer is not enabled either. While the clock could be turned on this would cause an endless whack a mole game to chase the proper register in each affected chipset. These systems provide the relevant frequencies for TSC, CPU and the local APIC timer via CPUID and/or MSRs, which allows to avoid the PIT/HPET based calibration. As the calibration code is the only usage of the legacy timers on modern systems and is skipped anyway when the frequencies are known already, there is no point in setting up the PIT and actually checking for the interrupt delivery via IO/APIC. To achieve this on a wide variety of platforms, the CPUID/MSR based frequency readout has been made more robust, which also allowed to remove quite some workarounds which turned out to be not longer required. Thanks to Daniel Drake for analysis, patches and verification" * 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/irq: Seperate unused system vectors from spurious entry again x86/irq: Handle spurious interrupt after shutdown gracefully x86/ioapic: Implement irq_get_irqchip_state() callback genirq: Add optional hardware synchronization for shutdown genirq: Fix misleading synchronize_irq() documentation genirq: Delay deactivation in free_irq() x86/timer: Skip PIT initialization on modern chipsets x86/apic: Use non-atomic operations when possible x86/apic: Make apic_bsp_setup() static x86/tsc: Set LAPIC timer period to crystal clock frequency x86/apic: Rename 'lapic_timer_frequency' to 'lapic_timer_period' x86/tsc: Use CPUID.0x16 to calculate missing crystal frequency
2019-07-03x86/irq: Handle spurious interrupt after shutdown gracefullyThomas Gleixner1-2/+2
Since the rework of the vector management, warnings about spurious interrupts have been reported. Robert provided some more information and did an initial analysis. The following situation leads to these warnings: CPU 0 CPU 1 IO_APIC interrupt is raised sent to CPU1 Unable to handle immediately (interrupts off, deep idle delay) mask() ... free() shutdown() synchronize_irq() clear_vector() do_IRQ() -> vector is clear Before the rework the vector entries of legacy interrupts were statically assigned and occupied precious vector space while most of them were unused. Due to that the above situation was handled silently because the vector was handled and the core handler of the assigned interrupt descriptor noticed that it is shut down and returned. While this has been usually observed with legacy interrupts, this situation is not limited to them. Any other interrupt source, e.g. MSI, can cause the same issue. After adding proper synchronization for level triggered interrupts, this can only happen for edge triggered interrupts where the IO-APIC obviously cannot provide information about interrupts in flight. While the spurious warning is actually harmless in this case it worries users and driver developers. Handle it gracefully by marking the vector entry as VECTOR_SHUTDOWN instead of VECTOR_UNUSED when the vector is freed up. If that above late handling happens the spurious detector will not complain and switch the entry to VECTOR_UNUSED. Any subsequent spurious interrupt on that line will trigger the spurious warning as before. Fixes: 464d12309e1b ("x86/vector: Switch IOAPIC to global reservation mode") Reported-by: Robert Hodaszi <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]>- Tested-by: Robert Hodaszi <[email protected]> Cc: Marc Zyngier <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2019-06-19treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500Thomas Gleixner1-4/+1
Based on 2 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license version 2 as published by the free software foundation this program is free software you can redistribute it and or modify it under the terms of the gnu general public license version 2 as published by the free software foundation # extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 4122 file(s). Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Enrico Weigelt <[email protected]> Reviewed-by: Kate Stewart <[email protected]> Reviewed-by: Allison Randal <[email protected]> Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2018-12-08x86/kernel: Fix more -Wmissing-prototypes warningsBorislav Petkov1-0/+1
... with the goal of eventually enabling -Wmissing-prototypes by default. At least on x86. Make functions static where possible, otherwise add prototypes or make them visible through includes. asm/trace/ changes courtesy of Steven Rostedt <[email protected]>. Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Masami Hiramatsu <[email protected]> Reviewed-by: Ingo Molnar <[email protected]> Acked-by: Rafael J. Wysocki <[email protected]> # ACPI + cpufreq bits Cc: Andrew Banman <[email protected]> Cc: Dimitri Sivanich <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Masami Hiramatsu <[email protected]> Cc: Mike Travis <[email protected]> Cc: "Steven Rostedt (VMware)" <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Yi Wang <[email protected]> Cc: [email protected]
2018-09-18irq/matrix: Spread managed interrupts on allocationDou Liyang1-5/+4
Linux spreads out the non managed interrupt across the possible target CPUs to avoid vector space exhaustion. Managed interrupts are treated differently, as for them the vectors are reserved (with guarantee) when the interrupt descriptors are initialized. When the interrupt is requested a real vector is assigned. The assignment logic uses the first CPU in the affinity mask for assignment. If the interrupt has more than one CPU in the affinity mask, which happens when a multi queue device has less queues than CPUs, then doing the same search as for non managed interrupts makes sense as it puts the interrupt on the least interrupt plagued CPU. For single CPU affine vectors that's obviously a NOOP. Restructre the matrix allocation code so it does the 'best CPU' search, add the sanity check for an empty affinity mask and adapt the call site in the x86 vector management code. [ tglx: Added the empty mask check to the core and improved change log ] Signed-off-by: Dou Liyang <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected]
2018-09-08x86/apic/vector: Make error return value negativeThomas Gleixner1-1/+1
activate_managed() returns EINVAL instead of -EINVAL in case of error. While this is unlikely to happen, the positive return value would cause further malfunction at the call site. Fixes: 2db1f959d9dc ("x86/vector: Handle managed interrupts proper") Signed-off-by: Thomas Gleixner <[email protected]> Cc: [email protected]
2018-08-14Merge branch 'l1tf-final' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Merge L1 Terminal Fault fixes from Thomas Gleixner: "L1TF, aka L1 Terminal Fault, is yet another speculative hardware engineering trainwreck. It's a hardware vulnerability which allows unprivileged speculative access to data which is available in the Level 1 Data Cache when the page table entry controlling the virtual address, which is used for the access, has the Present bit cleared or other reserved bits set. If an instruction accesses a virtual address for which the relevant page table entry (PTE) has the Present bit cleared or other reserved bits set, then speculative execution ignores the invalid PTE and loads the referenced data if it is present in the Level 1 Data Cache, as if the page referenced by the address bits in the PTE was still present and accessible. While this is a purely speculative mechanism and the instruction will raise a page fault when it is retired eventually, the pure act of loading the data and making it available to other speculative instructions opens up the opportunity for side channel attacks to unprivileged malicious code, similar to the Meltdown attack. While Meltdown breaks the user space to kernel space protection, L1TF allows to attack any physical memory address in the system and the attack works across all protection domains. It allows an attack of SGX and also works from inside virtual machines because the speculation bypasses the extended page table (EPT) protection mechanism. The assoicated CVEs are: CVE-2018-3615, CVE-2018-3620, CVE-2018-3646 The mitigations provided by this pull request include: - Host side protection by inverting the upper address bits of a non present page table entry so the entry points to uncacheable memory. - Hypervisor protection by flushing L1 Data Cache on VMENTER. - SMT (HyperThreading) control knobs, which allow to 'turn off' SMT by offlining the sibling CPU threads. The knobs are available on the kernel command line and at runtime via sysfs - Control knobs for the hypervisor mitigation, related to L1D flush and SMT control. The knobs are available on the kernel command line and at runtime via sysfs - Extensive documentation about L1TF including various degrees of mitigations. Thanks to all people who have contributed to this in various ways - patches, review, testing, backporting - and the fruitful, sometimes heated, but at the end constructive discussions. There is work in progress to provide other forms of mitigations, which might be less horrible performance wise for a particular kind of workloads, but this is not yet ready for consumption due to their complexity and limitations" * 'l1tf-final' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (75 commits) x86/microcode: Allow late microcode loading with SMT disabled tools headers: Synchronise x86 cpufeatures.h for L1TF additions x86/mm/kmmio: Make the tracer robust against L1TF x86/mm/pat: Make set_memory_np() L1TF safe x86/speculation/l1tf: Make pmd/pud_mknotpresent() invert x86/speculation/l1tf: Invert all not present mappings cpu/hotplug: Fix SMT supported evaluation KVM: VMX: Tell the nested hypervisor to skip L1D flush on vmentry x86/speculation: Use ARCH_CAPABILITIES to skip L1D flush on vmentry x86/speculation: Simplify sysfs report of VMX L1TF vulnerability Documentation/l1tf: Remove Yonah processors from not vulnerable list x86/KVM/VMX: Don't set l1tf_flush_l1d from vmx_handle_external_intr() x86/irq: Let interrupt handlers set kvm_cpu_l1tf_flush_l1d x86: Don't include linux/irq.h from asm/hardirq.h x86/KVM/VMX: Introduce per-host-cpu analogue of l1tf_flush_l1d x86/irq: Demote irq_cpustat_t::__softirq_pending to u16 x86/KVM/VMX: Move the l1tf_flush_l1d test to vmx_l1d_flush() x86/KVM/VMX: Replace 'vmx_l1d_flush_always' with 'vmx_l1d_flush_cond' x86/KVM/VMX: Don't set l1tf_flush_l1d to true from vmx_l1d_flush() cpu/hotplug: detect SMT disabled by BIOS ...
2018-08-13Merge branch 'x86-apic-for-linus' of ↵Linus Torvalds1-14/+5
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 apic update from Thomas Gleixner: "Trivial cleanups of the APIC related code" * 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/apic: Trivial coding style fixes x86/vector: Merge allocate_vector() into assign_vector_locked()
2018-08-05x86: Don't include linux/irq.h from asm/hardirq.hNicolai Stange1-0/+1
The next patch in this series will have to make the definition of irq_cpustat_t available to entering_irq(). Inclusion of asm/hardirq.h into asm/apic.h would cause circular header dependencies like asm/smp.h asm/apic.h asm/hardirq.h linux/irq.h linux/topology.h linux/smp.h asm/smp.h or linux/gfp.h linux/mmzone.h asm/mmzone.h asm/mmzone_64.h asm/smp.h asm/apic.h asm/hardirq.h linux/irq.h linux/irqdesc.h linux/kobject.h linux/sysfs.h linux/kernfs.h linux/idr.h linux/gfp.h and others. This causes compilation errors because of the header guards becoming effective in the second inclusion: symbols/macros that had been defined before wouldn't be available to intermediate headers in the #include chain anymore. A possible workaround would be to move the definition of irq_cpustat_t into its own header and include that from both, asm/hardirq.h and asm/apic.h. However, this wouldn't solve the real problem, namely asm/harirq.h unnecessarily pulling in all the linux/irq.h cruft: nothing in asm/hardirq.h itself requires it. Also, note that there are some other archs, like e.g. arm64, which don't have that #include in their asm/hardirq.h. Remove the linux/irq.h #include from x86' asm/hardirq.h. Fix resulting compilation errors by adding appropriate #includes to *.c files as needed. Note that some of these *.c files could be cleaned up a bit wrt. to their set of #includes, but that should better be done from separate patches, if at all. Signed-off-by: Nicolai Stange <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]>
2018-07-30x86/apic: Trivial coding style fixesYi Wang1-1/+1
There is inconsistent indenting in calibrate_APIC_clock() and activate_managed(). Remove the surplus TAB. Signed-off-by: Yi Wang <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Jiang Biao <[email protected]> Acked-by: Steven Rostedt (VMware) <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected]
2018-06-06x86/apic/vector: Print APIC control bits in debugfsThomas Gleixner1-13/+14
Extend the debugability of the vector management by adding the state bits to the debugfs output. Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Song Liu <[email protected]> Cc: Joerg Roedel <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Song Liu <[email protected]> Cc: Dmitry Safonov <[email protected]> Cc: Mike Travis <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Tariq Toukan <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2018-06-06x86/apic: Provide apic_ack_irq()Thomas Gleixner1-2/+7
apic_ack_edge() is explicitely for handling interrupt affinity cleanup when interrupt remapping is not available or disable. Remapped interrupts and also some of the platform specific special interrupts, e.g. UV, invoke ack_APIC_irq() directly. To address the issue of failing an affinity update with -EBUSY the delayed affinity mechanism can be reused, but ack_APIC_irq() does not handle that. Adding this to ack_APIC_irq() is not possible, because that function is also used for exceptions and directly handled interrupts like IPIs. Create a new function, which just contains the conditional invocation of irq_move_irq() and the final ack_APIC_irq(). Reuse the new function in apic_ack_edge(). Preparatory change for the real fix. Fixes: dccfe3147b42 ("x86/vector: Simplify vector move cleanup") Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Song Liu <[email protected]> Cc: Joerg Roedel <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Song Liu <[email protected]> Cc: Dmitry Safonov <[email protected]> Cc: [email protected] Cc: Mike Travis <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Tariq Toukan <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2018-06-06x86/apic/vector: Prevent hlist corruption and leaksThomas Gleixner1-0/+9
Several people observed the WARN_ON() in irq_matrix_free() which triggers when the caller tries to free an vector which is not in the allocation range. Song provided the trace information which allowed to decode the root cause. The rework of the vector allocation mechanism failed to preserve a sanity check, which prevents setting a new target vector/CPU when the previous affinity change has not fully completed. As a result a half finished affinity change can be overwritten, which can cause the leak of a irq descriptor pointer on the previous target CPU and double enqueue of the hlist head into the cleanup lists of two or more CPUs. After one CPU cleaned up its vector the next CPU will invoke the cleanup handler with vector 0, which triggers the out of range warning in the matrix allocator. Prevent this by checking the apic_data of the interrupt whether the move_in_progress flag is false and the hlist node is not hashed. Return -EBUSY if not. This prevents the damage and restores the behaviour before the vector allocation rework, but due to other changes in that area it also widens the chance that user space can observe -EBUSY. In theory this should be fine, but actually not all user space tools handle -EBUSY correctly. Addressing that is not part of this fix, but will be addressed in follow up patches. Fixes: 69cde0004a4b ("x86/vector: Use matrix allocator for vector assignment") Reported-by: Dmitry Safonov <[email protected]> Reported-by: Tariq Toukan <[email protected]> Reported-by: Song Liu <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Song Liu <[email protected]> Cc: Joerg Roedel <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: [email protected] Cc: Mike Travis <[email protected]> Cc: Borislav Petkov <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2018-05-19x86/vector: Merge allocate_vector() into assign_vector_locked()Dou Liyang1-13/+4
assign_vector_locked() calls allocate_vector() to get a real vector for an IRQ. If the current target CPU is online and in the new requested affinity mask, allocate_vector() will return 0 and nothing should be done. But, assign_vector_locked() calls apic_update_irq_cfg() even in that case which is pointless. allocate_vector() is not called from anything else, so the functions can be merged and in case of no change the apic_update_irq_cfg() can be avoided. [ tglx: Massaged changelog ] Signed-off-by: Dou Liyang <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected]
2018-02-23x86/apic/vector: Handle vector release on CPU unplug correctlyThomas Gleixner1-3/+22
When a irq vector is replaced, then the previous vector is normally released when the first interrupt happens on the new vector. If the target CPU of the previous vector is already offline when the new vector is installed, then the previous vector is silently discarded, which leads to accounting issues causing suspend failures and other problems. Adjust the logic so that the previous vector is freed in the underlying matrix allocator to ensure that the accounting stays correct. Fixes: 69cde0004a4b ("x86/vector: Use matrix allocator for vector assignment") Reported-by: Yuriy Vostrikov <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Yuriy Vostrikov <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-01-17x86/apic/vector: Fix off by one in error pathThomas Gleixner1-2/+5
Keith reported the following warning: WARNING: CPU: 28 PID: 1420 at kernel/irq/matrix.c:222 irq_matrix_remove_managed+0x10f/0x120 x86_vector_free_irqs+0xa1/0x180 x86_vector_alloc_irqs+0x1e4/0x3a0 msi_domain_alloc+0x62/0x130 The reason for this is that if the vector allocation fails the error handling code tries to free the failed vector as well, which causes the above imbalance warning to trigger. Adjust the error path to handle this correctly. Fixes: b5dc8e6c21e7 ("x86/irq: Use hierarchical irqdomain to manage CPU interrupt vectors") Reported-by: Keith Busch <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Keith Busch <[email protected]> Cc: [email protected] Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1801161217300.1823@nanos
2017-12-29genirq/msi, x86/vector: Prevent reservation mode for non maskable MSIThomas Gleixner1-1/+11
The new reservation mode for interrupts assigns a dummy vector when the interrupt is allocated and assigns a real vector when the interrupt is requested. The reservation mode prevents vector pressure when devices with a large amount of queues/interrupts are initialized, but only a minimal subset of those queues/interrupts is actually used. This mode has an issue with MSI interrupts which cannot be masked. If the driver is not careful or the hardware emits an interrupt before the device irq is requestd by the driver then the interrupt ends up on the dummy vector as a spurious interrupt which can cause malfunction of the device or in the worst case a lockup of the machine. Change the logic for the reservation mode so that the early activation of MSI interrupts checks whether: - the device is a PCI/MSI device - the reservation mode of the underlying irqdomain is activated - PCI/MSI masking is globally enabled - the PCI/MSI device uses either MSI-X, which supports masking, or MSI with the maskbit supported. If one of those conditions is false, then clear the reservation mode flag in the irq data of the interrupt and invoke irq_domain_activate_irq() with the reserve argument cleared. In the x86 vector code, clear the can_reserve flag in the vector allocation data so a subsequent free_irq() won't create the same situation again. The interrupt stays assigned to a real vector until pci_disable_msi() is invoked and all allocations are undone. Fixes: 4900be83602b ("x86/vector/msi: Switch to global reservation mode") Reported-by: Alexandru Chirvasitu <[email protected]> Reported-by: Andy Shevchenko <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Alexandru Chirvasitu <[email protected]> Tested-by: Andy Shevchenko <[email protected]> Cc: Dou Liyang <[email protected]> Cc: Pavel Machek <[email protected]> Cc: Maciej W. Rozycki <[email protected]> Cc: Mikael Pettersson <[email protected]> Cc: Josh Poulson <[email protected]> Cc: Mihai Costache <[email protected]> Cc: Stephen Hemminger <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: [email protected] Cc: Haiyang Zhang <[email protected]> Cc: Dexuan Cui <[email protected]> Cc: Simon Xiao <[email protected]> Cc: Saeed Mahameed <[email protected]> Cc: Jork Loeser <[email protected]> Cc: Bjorn Helgaas <[email protected]> Cc: [email protected] Cc: KY Srinivasan <[email protected]> Cc: Alan Cox <[email protected]> Cc: Sakari Ailus <[email protected]>, Cc: [email protected] Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1712291406420.1899@nanos Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1712291409460.1899@nanos
2017-12-29genirq/irqdomain: Rename early argument of irq_domain_activate_irq()Thomas Gleixner1-3/+3
The 'early' argument of irq_domain_activate_irq() is actually used to denote reservation mode. To avoid confusion, rename it before abuse happens. No functional change. Fixes: 72491643469a ("genirq/irqdomain: Update irq_domain_ops.activate() signature") Signed-off-by: Thomas Gleixner <[email protected]> Cc: Alexandru Chirvasitu <[email protected]> Cc: Andy Shevchenko <[email protected]> Cc: Dou Liyang <[email protected]> Cc: Pavel Machek <[email protected]> Cc: Maciej W. Rozycki <[email protected]> Cc: Mikael Pettersson <[email protected]> Cc: Josh Poulson <[email protected]> Cc: Mihai Costache <[email protected]> Cc: Stephen Hemminger <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: [email protected] Cc: Haiyang Zhang <[email protected]> Cc: Dexuan Cui <[email protected]> Cc: Simon Xiao <[email protected]> Cc: Saeed Mahameed <[email protected]> Cc: Jork Loeser <[email protected]> Cc: Bjorn Helgaas <[email protected]> Cc: [email protected] Cc: KY Srinivasan <[email protected]> Cc: Alan Cox <[email protected]> Cc: Sakari Ailus <[email protected]>, Cc: [email protected]
2017-12-29x86/vector: Use IRQD_CAN_RESERVE flagThomas Gleixner1-0/+2
Set the new CAN_RESERVE flag when the initial reservation for an interrupt happens. The flag is used in a subsequent patch to disable reservation mode for a certain class of MSI devices. Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Alexandru Chirvasitu <[email protected]> Tested-by: Andy Shevchenko <[email protected]> Cc: Dou Liyang <[email protected]> Cc: Pavel Machek <[email protected]> Cc: Maciej W. Rozycki <[email protected]> Cc: Mikael Pettersson <[email protected]> Cc: Josh Poulson <[email protected]> Cc: Mihai Costache <[email protected]> Cc: Stephen Hemminger <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: [email protected] Cc: Haiyang Zhang <[email protected]> Cc: Dexuan Cui <[email protected]> Cc: Simon Xiao <[email protected]> Cc: Saeed Mahameed <[email protected]> Cc: Jork Loeser <[email protected]> Cc: Bjorn Helgaas <[email protected]> Cc: [email protected] Cc: KY Srinivasan <[email protected]> Cc: Alan Cox <[email protected]> Cc: Sakari Ailus <[email protected]>, Cc: [email protected]
2017-12-06x86: Fix Sparse warnings about non-static functionsColin Ian King1-2/+2
Functions x86_vector_debug_show(), uv_handle_nmi() and uv_nmi_setup_common() are local to the source and do not need to be in global scope, so make them static. Fixes up various sparse warnings. Signed-off-by: Colin Ian King <[email protected]> Acked-by: Mike Travis <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Jiri Kosina <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Russ Anderson <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-11-23x86/PCI: Remove unused HyperTransport interrupt supportBjorn Helgaas1-3/+2
There are no in-tree callers of ht_create_irq(), the driver interface for HyperTransport interrupts, left. Remove the unused entry point and all the supporting code. See 8b955b0dddb3 ("[PATCH] Initial generic hypertransport interrupt support"). Signed-off-by: Bjorn Helgaas <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: "Eric W. Biederman" <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: [email protected] Cc: Benjamin Herrenschmidt <[email protected]> Link: https://lkml.kernel.org/r/20171122221337.3877.23362.stgit@bhelgaas-glaptop.roam.corp.google.com
2017-10-17x86/vector: Use correct per cpu variable in free_moved_vector()Thomas Gleixner1-2/+2
free_moved_vector() accesses the per cpu vector array with this_cpu_write() to clear the vector. The function has two call sites: 1) The vector cleanup IPI 2) The force_complete_move() code path For #1 this_cpu_write() is correct as it runs on the CPU on which the vector needs to be freed. For #2 this_cpu_write() is wrong because the function is called from an outgoing CPU which is not necessarily the CPU on which the previous vector needs to be freed. As a result it sets the vector on the outgoing CPU to NULL, which is pointless as that CPU does not handle interrupts anymore. What's worse is that it leaves the vector on the previous target CPU in place which later on triggers the BUG_ON(vector) in the vector allocation code when the vector gets reused. That's possible because the bitmap allocator entry of that CPU is freed correctly. Always use the CPU to which the vector was associated and clear the vector entry on that CPU. Fixup the tracepoint as well so it tracks on which CPU the vector gets removed. Fixes: 69cde0004a4b ("x86/vector: Use matrix allocator for vector assignment") Reported-by: Petri Latvala <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: Juergen Gross <[email protected]> Cc: Tony Luck <[email protected]> Cc: Len Brown <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Joerg Roedel <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rui Zhang <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Boris Ostrovsky <[email protected]> Cc: "K. Y. Srinivasan" <[email protected]> Cc: Arjan van de Ven <[email protected]> Cc: Alok Kataria <[email protected]> Cc: Dan Williams <[email protected]> Cc: Yu Chen <[email protected]> Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1710161614430.1973@nanos
2017-10-12x86/apic/vector: Ignore set_affinity call for inactive interruptsThomas Gleixner1-0/+13
The core interrupt code can call the affinity setter for inactive interrupts under certain circumstances. For inactive intererupts which use managed or reservation mode this is a pointless exercise as the activation will assign a vector which fits the destination mask. Check for this and return w/o going through the vector assignment. Signed-off-by: Thomas Gleixner <[email protected]>
2017-09-25x86/vector: Respect affinity mask in irq descriptorThomas Gleixner1-4/+17
The interrupt descriptor has a preset affinity mask at allocation time, which is usually the default affinity mask. The current code does not respect that mask and places the vector at some random CPU, which gets corrected later by a set_affinity() call. That's silly because the vector allocation can respect the mask upfront and place the interrupt on a CPU which is in the mask. If that fails, then the affinity is broken and a interrupt assigned on any online CPU. Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Juergen Gross <[email protected]> Tested-by: Yu Chen <[email protected]> Acked-by: Juergen Gross <[email protected]> Cc: Boris Ostrovsky <[email protected]> Cc: Tony Luck <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Alok Kataria <[email protected]> Cc: Joerg Roedel <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Rui Zhang <[email protected]> Cc: "K. Y. Srinivasan" <[email protected]> Cc: Arjan van de Ven <[email protected]> Cc: Dan Williams <[email protected]> Cc: Len Brown <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2017-09-25x86/irq: Simplify hotplug vector accountingThomas Gleixner1-1/+31
Before a CPU is taken offline the number of active interrupt vectors on the outgoing CPU and the number of vectors which are available on the other online CPUs are counted and compared. If the active vectors are more than the available vectors on the other CPUs then the CPU hot-unplug operation is aborted. This again uses loop based search and is inaccurate. The bitmap matrix allocator has accurate accounting information and can tell exactly whether the vector space is sufficient or not. Emit a message when the number of globaly reserved (unallocated) vectors is larger than the number of available vectors after offlining a CPU because after that point request_irq() might fail. Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Juergen Gross <[email protected]> Tested-by: Yu Chen <[email protected]> Acked-by: Juergen Gross <[email protected]> Cc: Boris Ostrovsky <[email protected]> Cc: Tony Luck <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Alok Kataria <[email protected]> Cc: Joerg Roedel <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Rui Zhang <[email protected]> Cc: "K. Y. Srinivasan" <[email protected]> Cc: Arjan van de Ven <[email protected]> Cc: Dan Williams <[email protected]> Cc: Len Brown <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2017-09-25x86/vector: Switch IOAPIC to global reservation modeThomas Gleixner1-23/+33
IOAPICs install and allocate vectors for inactive interrupts. This results in problems on CPU offline and wastes vector resources for nothing. Handle inactive IOAPIC interrupts in the same way as inactive MSI interrupts and switch them to the global reservation mode. Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Juergen Gross <[email protected]> Tested-by: Yu Chen <[email protected]> Acked-by: Juergen Gross <[email protected]> Cc: Boris Ostrovsky <[email protected]> Cc: Tony Luck <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Alok Kataria <[email protected]> Cc: Joerg Roedel <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Rui Zhang <[email protected]> Cc: "K. Y. Srinivasan" <[email protected]> Cc: Arjan van de Ven <[email protected]> Cc: Dan Williams <[email protected]> Cc: Len Brown <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2017-09-25x86/vector/msi: Switch to global reservation modeThomas Gleixner1-34/+63
Devices with many queues allocate a huge number of interrupts and get assigned a vector for each of them, even if the queues are not active and the interrupts never requested. This causes problems with the decision whether the global vector space is sufficient for CPU hot unplug operations. Change it to a reservation scheme, which allows overcommitment. When the interrupt is allocated and initialized the vector assignment merily updates the reservation request counter in the matrix allocator. This counter is used to emit warnings when the reservation exceeds the available vector space, but does not affect CPU offline operations. Like the managed interrupts the corresponding MSI/DMAR/IOAPIC entries are directed to the special shutdown vector. When the interrupt is requested, then the activation code tries to assign a real vector. If that succeeds the interrupt is started up and functional. If that fails, then subsequently request_irq() fails with -ENOSPC. This allows a clear separation of inactive and active modes and simplifies the final decisions whether the global vector space is sufficient for CPU offline operations. Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Juergen Gross <[email protected]> Tested-by: Yu Chen <[email protected]> Acked-by: Juergen Gross <[email protected]> Cc: Boris Ostrovsky <[email protected]> Cc: Tony Luck <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Alok Kataria <[email protected]> Cc: Joerg Roedel <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Rui Zhang <[email protected]> Cc: "K. Y. Srinivasan" <[email protected]> Cc: Arjan van de Ven <[email protected]> Cc: Dan Williams <[email protected]> Cc: Len Brown <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2017-09-25x86/vector: Handle managed interrupts properThomas Gleixner1-18/+172
Managed interrupts need to reserve interrupt vectors permanently, but as long as the interrupt is deactivated, the vector should not be active. Reserve a new system vector, which can be used to initially initialize MSI/DMAR/IOAPIC entries. In that situation the interrupts are disabled in the corresponding MSI/DMAR/IOAPIC devices. So the vector should never be sent to any CPU. When the managed interrupt is started up, a real vector is assigned from the managed vector space and configured in MSI/DMAR/IOAPIC. This allows a clear separation of inactive and active modes and simplifies the final decisions whether the global vector space is sufficient for CPU offline operations. The vector space can be reserved even on offline CPUs and will survive CPU offline/online operations. Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Juergen Gross <[email protected]> Tested-by: Yu Chen <[email protected]> Acked-by: Juergen Gross <[email protected]> Cc: Boris Ostrovsky <[email protected]> Cc: Tony Luck <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Alok Kataria <[email protected]> Cc: Joerg Roedel <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Rui Zhang <[email protected]> Cc: "K. Y. Srinivasan" <[email protected]> Cc: Arjan van de Ven <[email protected]> Cc: Dan Williams <[email protected]> Cc: Len Brown <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2017-09-25x86/vector: Untangle internal state from irq_cfgThomas Gleixner1-40/+48
The vector management state is not required to live in irq_cfg. irq_cfg is only relevant for the depending irq domains (IOAPIC, DMAR, MSI ...). The seperation of the vector management status allows to direct a shut down interrupt to a special shutdown vector w/o confusing the internal state of the vector management. Preparatory change for the rework of managed interrupts and the global vector reservation scheme. Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Juergen Gross <[email protected]> Tested-by: Yu Chen <[email protected]> Acked-by: Juergen Gross <[email protected]> Cc: Boris Ostrovsky <[email protected]> Cc: Tony Luck <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Alok Kataria <[email protected]> Cc: Joerg Roedel <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Rui Zhang <[email protected]> Cc: "K. Y. Srinivasan" <[email protected]> Cc: Arjan van de Ven <[email protected]> Cc: Dan Williams <[email protected]> Cc: Len Brown <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2017-09-25x86/vector: Compile SMP only code conditionallyThomas Gleixner1-15/+20
No point in compiling this for UP. Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Juergen Gross <[email protected]> Tested-by: Yu Chen <[email protected]> Acked-by: Juergen Gross <[email protected]> Cc: Boris Ostrovsky <[email protected]> Cc: Tony Luck <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Alok Kataria <[email protected]> Cc: Joerg Roedel <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Rui Zhang <[email protected]> Cc: "K. Y. Srinivasan" <[email protected]> Cc: Arjan van de Ven <[email protected]> Cc: Dan Williams <[email protected]> Cc: Len Brown <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2017-09-25x86/vector: Use matrix allocator for vector assignmentThomas Gleixner1-172/+116
Replace the magic vector allocation code by a simple bitmap matrix allocator. This avoids loops and hoops over CPUs and vector arrays, so in case of densly used vector spaces it's way faster. This also gets rid of the magic 'spread the vectors accross priority levels' heuristics in the current allocator: The comment in __asign_irq_vector says: * NOTE! The local APIC isn't very good at handling * multiple interrupts at the same interrupt level. * As the interrupt level is determined by taking the * vector number and shifting that right by 4, we * want to spread these out a bit so that they don't * all fall in the same interrupt level. After doing some palaeontological research the following was found the following in the PPro Developer Manual Volume 3: "7.4.2. Valid Interrupts The local and I/O APICs support 240 distinct vectors in the range of 16 to 255. Interrupt priority is implied by its vector, according to the following relationship: priority = vector / 16 One is the lowest priority and 15 is the highest. Vectors 16 through 31 are reserved for exclusive use by the processor. The remaining vectors are for general use. The processor's local APIC includes an in-service entry and a holding entry for each priority level. To avoid losing inter- rupts, software should allocate no more than 2 interrupt vectors per priority." The current SDM tells nothing about that, instead it states: "If more than one interrupt is generated with the same vector number, the local APIC can set the bit for the vector both in the IRR and the ISR. This means that for the Pentium 4 and Intel Xeon processors, the IRR and ISR can queue two interrupts for each interrupt vector: one in the IRR and one in the ISR. Any additional interrupts issued for the same interrupt vector are collapsed into the single bit in the IRR. For the P6 family and Pentium processors, the IRR and ISR registers can queue no more than two interrupts per interrupt vector and will reject other interrupts that are received within the same vector." Which means, that on P6/Pentium the APIC will reject a new message and tell the sender to retry, which increases the load on the APIC bus and nothing more. There is no affirmative answer from Intel on that, but it's a sane approach to remove that for the following reasons: 1) No other (relevant Open Source) operating systems bothers to implement this or mentiones this at all. 2) The current allocator has no enforcement for this and especially the legacy interrupts, which are the main source of interrupts on these P6 and older systmes, are allocated linearly in the same priority level and just work. 3) The current machines have no problem with that at all as verified with some experiments. 4) AMD at least confirmed that such an issue is unknown. 5) P6 and older are dinosaurs almost 20 years EOL, so there is really no reason to worry about that too much. Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Juergen Gross <[email protected]> Tested-by: Yu Chen <[email protected]> Acked-by: Juergen Gross <[email protected]> Cc: Boris Ostrovsky <[email protected]> Cc: Tony Luck <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Alok Kataria <[email protected]> Cc: Joerg Roedel <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Rui Zhang <[email protected]> Cc: "K. Y. Srinivasan" <[email protected]> Cc: Arjan van de Ven <[email protected]> Cc: Dan Williams <[email protected]> Cc: Len Brown <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2017-09-25x86/vector: Add tracepoints for vector managementThomas Gleixner1-0/+2
Add tracepoints for analysing the new vector management Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Juergen Gross <[email protected]> Tested-by: Yu Chen <[email protected]> Acked-by: Juergen Gross <[email protected]> Cc: Boris Ostrovsky <[email protected]> Cc: Tony Luck <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Alok Kataria <[email protected]> Cc: Joerg Roedel <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Rui Zhang <[email protected]> Cc: "K. Y. Srinivasan" <[email protected]> Cc: Arjan van de Ven <[email protected]> Cc: Dan Williams <[email protected]> Cc: Len Brown <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2017-09-25x86/vector: Add vector domain debugfs supportThomas Gleixner1-2/+48
Add the debug callback for the vector domain, which gives a detailed information about vector usage if invoked for the domain by using rhe matrix allocator debug function and vector/target information when invoked for a particular interrupt. Extra information foir the Vector domain: Online bitmaps: 32 Global available: 6352 Global reserved: 5 Total allocated: 20 System: 41: 0-19,32,50,128,238-255 | CPU | avl | man | act | vectors 0 183 4 19 33-48,51-53 1 199 4 1 33 2 199 4 0 Extra information for interrupts: Vector: 42 Target: 4 This allows a detailed analysis of the vector usage and the association to interrupts and devices. Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Juergen Gross <[email protected]> Tested-by: Yu Chen <[email protected]> Acked-by: Juergen Gross <[email protected]> Cc: Boris Ostrovsky <[email protected]> Cc: Tony Luck <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Alok Kataria <[email protected]> Cc: Joerg Roedel <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Rui Zhang <[email protected]> Cc: "K. Y. Srinivasan" <[email protected]> Cc: Arjan van de Ven <[email protected]> Cc: Dan Williams <[email protected]> Cc: Len Brown <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2017-09-25x86/irq/vector: Initialize matrix allocatorThomas Gleixner1-4/+52
Initialize the matrix allocator and add the proper accounting points to the code. No functional change, just preparation. Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Juergen Gross <[email protected]> Tested-by: Yu Chen <[email protected]> Acked-by: Juergen Gross <[email protected]> Cc: Boris Ostrovsky <[email protected]> Cc: Tony Luck <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Alok Kataria <[email protected]> Cc: Joerg Roedel <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Rui Zhang <[email protected]> Cc: "K. Y. Srinivasan" <[email protected]> Cc: Arjan van de Ven <[email protected]> Cc: Dan Williams <[email protected]> Cc: Len Brown <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2017-09-25x86/vector: Move helper functions aroundThomas Gleixner1-15/+15
Move the helper functions to a different place as they would end up in the middle of management functions. Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Juergen Gross <[email protected]> Tested-by: Yu Chen <[email protected]> Acked-by: Juergen Gross <[email protected]> Cc: Boris Ostrovsky <[email protected]> Cc: Tony Luck <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Alok Kataria <[email protected]> Cc: Joerg Roedel <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Rui Zhang <[email protected]> Cc: "K. Y. Srinivasan" <[email protected]> Cc: Arjan van de Ven <[email protected]> Cc: Dan Williams <[email protected]> Cc: Len Brown <[email protected]> Link: https://lkml.kernel.org/r/[email protected]