aboutsummaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)AuthorFilesLines
2023-04-21hugetlb: pte_alloc_huge() to replace huge pte_alloc_map()Hugh Dickins1-0/+17
Some architectures can have their hugetlb pages down at the lowest PTE level: their huge_pte_alloc() using pte_alloc_map(), but without any following pte_unmap(). Since none of these arches uses CONFIG_HIGHPTE, this is not seen as a problem at present; but would become a problem if forthcoming changes were to add an rcu_read_lock() into pte_offset_map(), with the rcu_read_unlock() expected in pte_unmap(). Similarly in their huge_pte_offset(): pte_offset_kernel() is good enough for that, but it's probably less confusing if we define pte_offset_huge() along with pte_alloc_huge(). Only define them without CONFIG_HIGHPTE: so there would be a build error to signal if ever more work is needed. For ease of development, define these now for 6.4-rc1, ahead of any use: then architectures can integrate patches using them, independent from mm. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Hugh Dickins <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Cc: Muchun Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-04-21mm: add new KSM process and sysfs knobsStefan Roesch1-0/+5
This adds the general_profit KSM sysfs knob and the process profit metric knobs to ksm_stat. 1) expose general_profit metric The documentation mentions a general profit metric, however this metric is not calculated. In addition the formula depends on the size of internal structures, which makes it more difficult for an administrator to make the calculation. Adding the metric for a better user experience. 2) document general_profit sysfs knob 3) calculate ksm process profit metric The ksm documentation mentions the process profit metric and how to calculate it. This adds the calculation of the metric. 4) mm: expose ksm process profit metric in ksm_stat This exposes the ksm process profit metric in /proc/<pid>/ksm_stat. The documentation mentions the formula for the ksm process profit metric, however it does not calculate it. In addition the formula depends on the size of internal structures. So it makes sense to expose it. 5) document new procfs ksm knobs Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Stefan Roesch <[email protected]> Reviewed-by: Bagas Sanjaya <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-04-21mm: add new api to enable ksm per processStefan Roesch2-2/+20
Patch series "mm: process/cgroup ksm support", v9. So far KSM can only be enabled by calling madvise for memory regions. To be able to use KSM for more workloads, KSM needs to have the ability to be enabled / disabled at the process / cgroup level. Use case 1: The madvise call is not available in the programming language. An example for this are programs with forked workloads using a garbage collected language without pointers. In such a language madvise cannot be made available. In addition the addresses of objects get moved around as they are garbage collected. KSM sharing needs to be enabled "from the outside" for these type of workloads. Use case 2: The same interpreter can also be used for workloads where KSM brings no benefit or even has overhead. We'd like to be able to enable KSM on a workload by workload basis. Use case 3: With the madvise call sharing opportunities are only enabled for the current process: it is a workload-local decision. A considerable number of sharing opportunities may exist across multiple workloads or jobs (if they are part of the same security domain). Only a higler level entity like a job scheduler or container can know for certain if its running one or more instances of a job. That job scheduler however doesn't have the necessary internal workload knowledge to make targeted madvise calls. Security concerns: In previous discussions security concerns have been brought up. The problem is that an individual workload does not have the knowledge about what else is running on a machine. Therefore it has to be very conservative in what memory areas can be shared or not. However, if the system is dedicated to running multiple jobs within the same security domain, its the job scheduler that has the knowledge that sharing can be safely enabled and is even desirable. Performance: Experiments with using UKSM have shown a capacity increase of around 20%. Here are the metrics from an instagram workload (taken from a machine with 64GB main memory): full_scans: 445 general_profit: 20158298048 max_page_sharing: 256 merge_across_nodes: 1 pages_shared: 129547 pages_sharing: 5119146 pages_to_scan: 4000 pages_unshared: 1760924 pages_volatile: 10761341 run: 1 sleep_millisecs: 20 stable_node_chains: 167 stable_node_chains_prune_millisecs: 2000 stable_node_dups: 2751 use_zero_pages: 0 zero_pages_sharing: 0 After the service is running for 30 minutes to an hour, 4 to 5 million shared pages are common for this workload when using KSM. Detailed changes: 1. New options for prctl system command This patch series adds two new options to the prctl system call. The first one allows to enable KSM at the process level and the second one to query the setting. The setting will be inherited by child processes. With the above setting, KSM can be enabled for the seed process of a cgroup and all processes in the cgroup will inherit the setting. 2. Changes to KSM processing When KSM is enabled at the process level, the KSM code will iterate over all the VMA's and enable KSM for the eligible VMA's. When forking a process that has KSM enabled, the setting will be inherited by the new child process. 3. Add general_profit metric The general_profit metric of KSM is specified in the documentation, but not calculated. This adds the general profit metric to /sys/kernel/debug/mm/ksm. 4. Add more metrics to ksm_stat This adds the process profit metric to /proc/<pid>/ksm_stat. 5. Add more tests to ksm_tests and ksm_functional_tests This adds an option to specify the merge type to the ksm_tests. This allows to test madvise and prctl KSM. It also adds a two new tests to ksm_functional_tests: one to test the new prctl options and the other one is a fork test to verify that the KSM process setting is inherited by client processes. This patch (of 3): So far KSM can only be enabled by calling madvise for memory regions. To be able to use KSM for more workloads, KSM needs to have the ability to be enabled / disabled at the process / cgroup level. 1. New options for prctl system command This patch series adds two new options to the prctl system call. The first one allows to enable KSM at the process level and the second one to query the setting. The setting will be inherited by child processes. With the above setting, KSM can be enabled for the seed process of a cgroup and all processes in the cgroup will inherit the setting. 2. Changes to KSM processing When KSM is enabled at the process level, the KSM code will iterate over all the VMA's and enable KSM for the eligible VMA's. When forking a process that has KSM enabled, the setting will be inherited by the new child process. 1) Introduce new MMF_VM_MERGE_ANY flag This introduces the new flag MMF_VM_MERGE_ANY flag. When this flag is set, kernel samepage merging (ksm) gets enabled for all vma's of a process. 2) Setting VM_MERGEABLE on VMA creation When a VMA is created, if the MMF_VM_MERGE_ANY flag is set, the VM_MERGEABLE flag will be set for this VMA. 3) support disabling of ksm for a process This adds the ability to disable ksm for a process if ksm has been enabled for the process with prctl. 4) add new prctl option to get and set ksm for a process This adds two new options to the prctl system call - enable ksm for all vmas of a process (if the vmas support it). - query if ksm has been enabled for a process. 3. Disabling MMF_VM_MERGE_ANY for storage keys in s390 In the s390 architecture when storage keys are used, the MMF_VM_MERGE_ANY will be disabled. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Stefan Roesch <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Bagas Sanjaya <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-04-21mm: correct arg in reclaim_pages()/reclaim_clean_pages_from_list()Kefeng Wang1-1/+0
Both of them change the arg from page_list to folio_list when convert them to use a folio, but not the declaration, let's correct it, also move the reclaim_pages() from swap.h to internal.h as it only used in mm. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kefeng Wang <[email protected]> Reviwed-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: SeongJae Park <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-04-21fs/buffer: add folio_create_empty_buffers helperPankaj Raghav1-0/+2
Folio version of create_empty_buffers(). This is required to convert create_page_buffers() to folio_create_buffers() later in the series. It removes several calls to compound_head() as it works directly on folio compared to create_empty_buffers(). Hence, create_empty_buffers() has been modified to call folio_create_empty_buffers(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Pankaj Raghav <[email protected]> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Luis Chamberlain <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-04-21buffer: add folio_alloc_buffers() helperPankaj Raghav1-0/+2
Folio version of alloc_page_buffers() helper. This is required to convert create_page_buffers() to folio_create_buffers() later in the series. alloc_page_buffers() has been modified to call folio_alloc_buffers() which adds one call to compound_head() but folio_alloc_buffers() removes one call to compound_head() compared to the existing alloc_page_buffers() implementation. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Pankaj Raghav <[email protected]> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Luis Chamberlain <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-04-21fs/buffer: add folio_set_bh helperPankaj Raghav1-0/+2
Patch series "convert create_page_buffers to folio_create_buffers". One of the first kernel panic we hit when we try to increase the block size > 4k is inside create_page_buffers()[1]. Even though buffer.c function do not support large folios (folios > PAGE_SIZE) at the moment, these changes are required when we want to remove that constraint. This patch (of 4): The folio version of set_bh_page(). This is required to convert create_page_buffers() to folio_create_buffers() later in the series. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Pankaj Raghav <[email protected]> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Luis Chamberlain <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-04-21bpf: add test_run support for netfilter program typeFlorian Westphal1-0/+3
add glue code so a bpf program can be run using userspace-provided netfilter state and packet/skb. Default is to use ipv4:output hook point, but this can be overridden by userspace. Userspace provided netfilter state is restricted, only hook and protocol families can be overridden and only to ipv4/ipv6. Signed-off-by: Florian Westphal <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-04-21bpf: minimal support for programs hooked into netfilter frameworkFlorian Westphal1-0/+4
This adds minimal support for BPF_PROG_TYPE_NETFILTER bpf programs that will be invoked via the NF_HOOK() points in the ip stack. Invocation incurs an indirect call. This is not a necessity: Its possible to add 'DEFINE_BPF_DISPATCHER(nf_progs)' and handle the program invocation with the same method already done for xdp progs. This isn't done here to keep the size of this chunk down. Verifier restricts verdicts to either DROP or ACCEPT. Signed-off-by: Florian Westphal <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-04-21bpf: add bpf_link support for BPF_NETFILTER programsFlorian Westphal1-0/+1
Add bpf_link support skeleton. To keep this reviewable, no bpf program can be invoked yet, if a program is attached only a c-stub is called and not the actual bpf program. Defaults to 'y' if both netfilter and bpf syscall are enabled in kconfig. Uapi example usage: union bpf_attr attr = { }; attr.link_create.prog_fd = progfd; attr.link_create.attach_type = 0; /* unused */ attr.link_create.netfilter.pf = PF_INET; attr.link_create.netfilter.hooknum = NF_INET_LOCAL_IN; attr.link_create.netfilter.priority = -128; err = bpf(BPF_LINK_CREATE, &attr, sizeof(attr)); ... this would attach progfd to ipv4:input hook. Such hook gets removed automatically if the calling program exits. BPF_NETFILTER program invocation is added in followup change. NF_HOOK_OP_BPF enum will eventually be read from nfnetlink_hook, it allows to tell userspace which program is attached at the given hook when user runs 'nft hook list' command rather than just the priority and not-very-helpful 'this hook runs a bpf prog but I can't tell which one'. Will also be used to disallow registration of two bpf programs with same priority in a followup patch. v4: arm32 cmpxchg only supports 32bit operand s/prio/priority/ v3: restrict prog attachment to ip/ip6 for now, lets lift restrictions if more use cases pop up (arptables, ebtables, netdev ingress/egress etc). Signed-off-by: Florian Westphal <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-04-21iomap: Remove IOMAP_DIO_NOSYNC unused dio flagRitesh Harjani (IBM)1-6/+0
IOMAP_DIO_NOSYNC earlier was added for use in btrfs. But it seems for aio dsync writes this is not useful anyway. For aio dsync case, we we queue the request and return -EIOCBQUEUED. Now, since IOMAP_DIO_NOSYNC doesn't let iomap_dio_complete() to call generic_write_sync(), hence we may lose the sync write. Hence kill this flag as it is not in use by any FS now. Tested-by: Disha Goel <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Ritesh Harjani (IBM) <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2023-04-21fs.h: Add TRACE_IOCB_STRINGS for use in trace pointsRitesh Harjani (IBM)1-0/+14
Add TRACE_IOCB_STRINGS macro which can be used in the trace point patch to print different flag values with meaningful string output. Tested-by: Disha Goel <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Ritesh Harjani (IBM) <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> [djwong: line up strings all prettylike] Signed-off-by: Darrick J. Wong <[email protected]>
2023-04-21Merge tag 'irqchip-6.4' of ↵Thomas Gleixner2-6/+18
git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/core Pull irqchip changes from Marc Zyngier: - Large RISC-V IPI rework to make way for a new interrupt architecture - More Loongarch fixes from Lianmin Lv, fixing issues in the so called "dual-bridge" systems. - Workaround for the nvidia T241 chip that gets confused in 3 and 4 socket configurations, leading to the GIC malfunctionning in some contexts - Drop support for non-firmware driven GIC configurarations now that the old ARM11MP Cavium board is gone - Workaround for the Rockchip 3588 chip that doesn't correctly deal with the shareability attributes. - Replace uses of of_find_property() with the more appropriate of_property_read_bool() - Make bcm-6345-l1 request its MMIO region - Add suspend support to the SiFive PLIC - Drop support for stih415, stih416 and stid127 platforms Link: https://lore.kernel.org/lkml/[email protected]
2023-04-21Merge tag 'wireless-next-2023-04-21' of ↵Jakub Kicinski2-23/+54
git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next Kalle Valo says: ==================== wireless-next patches for v6.4 Most likely the last -next pull request for v6.4. We have changes all over. rtw88 now supports SDIO bus and iwlwifi continues to work on Wi-Fi 7 support. Not much stack changes this time. Major changes: cfg80211/mac80211 - fix some Fine Time Measurement (FTM) frames not being bufferable - flush frames before key removal to avoid potential unencrypted transmission depending on the hardware design iwlwifi - preparation for Wi-Fi 7 EHT and multi-link support rtw88 - SDIO bus support - RTL8822BS, RTL8822CS and RTL8821CS SDIO chipset support rtw89 - framework firmware backwards compatibility brcmfmac - Cypress 43439 SDIO support mt76 - mt7921 P2P support - mt7996 mesh A-MSDU support - mt7996 EHT support - mt7996 coredump support wcn36xx - support for pronto v3 hardware ath11k - PCIe DeviceTree bindings - WCN6750: enable SAR support ath10k - convert DeviceTree bindings to YAML * tag 'wireless-next-2023-04-21' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (261 commits) wifi: rtw88: Update spelling in main.h wifi: airo: remove ISA_DMA_API dependency wifi: rtl8xxxu: Simplify setting the initial gain wifi: rtl8xxxu: Add rtl8xxxu_write{8,16,32}_{set,clear} wifi: rtl8xxxu: Don't print the vendor/product/serial wifi: rtw88: Fix memory leak in rtw88_usb wifi: rtw88: call rtw8821c_switch_rf_set() according to chip variant wifi: rtw88: set pkg_type correctly for specific rtw8821c variants wifi: rtw88: rtw8821c: Fix rfe_option field width wifi: rtw88: usb: fix priority queue to endpoint mapping wifi: rtw88: 8822c: add iface combination wifi: rtw88: handle station mode concurrent scan with AP mode wifi: rtw88: prevent scan abort with other VIFs wifi: rtw88: refine reserved page flow for AP mode wifi: rtw88: disallow PS during AP mode wifi: rtw88: 8822c: extend reserved page number wifi: rtw88: add port switch for AP mode wifi: rtw88: add bitmap for dynamic port settings wifi: rtw89: mac: use regular int as return type of DLE buffer request wifi: mac80211: remove return value check of debugfs_create_dir() ... ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-04-21spi: Add TPM HW flow flagKrishna Yarlagadda1-3/+13
TPM specification [1] defines flow control over SPI. Client device can insert a wait state on MISO when address is transmitted by controller on MOSI. Detecting the wait state in software is only possible for full duplex controllers. For controllers that support only half- duplex, the wait state detection needs to be implemented in hardware. Add a flag SPI_TPM_HW_FLOW for TPM device to set when software flow control is not possible and hardware flow control is expected from SPI controller. Reference: [1] https://trustedcomputinggroup.org/resource/pc-client-platform-tpm -profile-ptp-specification/ Signed-off-by: Krishna Yarlagadda <[email protected]> Reviewed-by: Jerry Snitselaar <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Mark Brown <[email protected]>
2023-04-21posix-cpu-timers: Implement the missing timer_wait_running callbackThomas Gleixner1-6/+11
For some unknown reason the introduction of the timer_wait_running callback missed to fixup posix CPU timers, which went unnoticed for almost four years. Marco reported recently that the WARN_ON() in timer_wait_running() triggers with a posix CPU timer test case. Posix CPU timers have two execution models for expiring timers depending on CONFIG_POSIX_CPU_TIMERS_TASK_WORK: 1) If not enabled, the expiry happens in hard interrupt context so spin waiting on the remote CPU is reasonably time bound. Implement an empty stub function for that case. 2) If enabled, the expiry happens in task work before returning to user space or guest mode. The expired timers are marked as firing and moved from the timer queue to a local list head with sighand lock held. Once the timers are moved, sighand lock is dropped and the expiry happens in fully preemptible context. That means the expiring task can be scheduled out, migrated, interrupted etc. So spin waiting on it is more than suboptimal. The timer wheel has a timer_wait_running() mechanism for RT, which uses a per CPU timer-base expiry lock which is held by the expiry code and the task waiting for the timer function to complete blocks on that lock. This does not work in the same way for posix CPU timers as there is no timer base and expiry for process wide timers can run on any task belonging to that process, but the concept of waiting on an expiry lock can be used too in a slightly different way: - Add a mutex to struct posix_cputimers_work. This struct is per task and used to schedule the expiry task work from the timer interrupt. - Add a task_struct pointer to struct cpu_timer which is used to store a the task which runs the expiry. That's filled in when the task moves the expired timers to the local expiry list. That's not affecting the size of the k_itimer union as there are bigger union members already - Let the task take the expiry mutex around the expiry function - Let the waiter acquire a task reference with rcu_read_lock() held and block on the expiry mutex This avoids spin-waiting on a task which might not even be on a CPU and works nicely for RT too. Fixes: ec8f954a40da ("posix-timers: Use a callback for cancel synchronization on PREEMPT_RT") Reported-by: Marco Elver <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Marco Elver <[email protected]> Tested-by: Sebastian Andrzej Siewior <[email protected]> Reviewed-by: Frederic Weisbecker <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/r/87zg764ojw.ffs@tglx
2023-04-21Merge branch irq/gic-6.4 into irq/irqchip-nextMarc Zyngier2-6/+18
* irq/gic-6.4: : . : Collection of GIC/GICv3 fixes and cleanups : : - Workaround for the nvidia T241 chip that gets confused : in 3 and 4 socket configurations, leading to the GIC : malfunctionning in some contexts : : - Drop support for non-firmware driven GIC configurarations : now that the old ARM11MP Cavium board is gone : : - Workaround for the Rockchip 3588 chip that doesn't : correctly deal with the shareability attributes. : . irqchip/gic-v3: Add Rockchip 3588001 erratum workaround irqchip/gicv3: Workaround for NVIDIA erratum T241-FABRIC-4 irqchip/gic: Drop support for board files Signed-off-by: Marc Zyngier <[email protected]>
2023-04-21sched: Fix performance regression introduced by mm_cidMathieu Desnoyers3-8/+82
Introduce per-mm/cpu current concurrency id (mm_cid) to fix a PostgreSQL sysbench regression reported by Aaron Lu. Keep track of the currently allocated mm_cid for each mm/cpu rather than freeing them immediately on context switch. This eliminates most atomic operations when context switching back and forth between threads belonging to different memory spaces in multi-threaded scenarios (many processes, each with many threads). The per-mm/per-cpu mm_cid values are serialized by their respective runqueue locks. Thread migration is handled by introducing invocation to sched_mm_cid_migrate_to() (with destination runqueue lock held) in activate_task() for migrating tasks. If the destination cpu's mm_cid is unset, and if the source runqueue is not actively using its mm_cid, then the source cpu's mm_cid is moved to the destination cpu on migration. Introduce a task-work executed periodically, similarly to NUMA work, which delays reclaim of cid values when they are unused for a period of time. Keep track of the allocation time for each per-cpu cid, and let the task work clear them when they are observed to be older than SCHED_MM_CID_PERIOD_NS and unused. This task work also clears all mm_cids which are greater or equal to the Hamming weight of the mm cidmask to keep concurrency ids compact. Because we want to ensure the mm_cid converges towards the smaller values as migrations happen, the prior optimization that was done when context switching between threads belonging to the same mm is removed, because it could delay the lazy release of the destination runqueue mm_cid after it has been replaced by a migration. Removing this prior optimization is not an issue performance-wise because the introduced per-mm/per-cpu mm_cid tracking also covers this more specific case. Fixes: af7f588d8f73 ("sched: Introduce per-memory-map concurrency ID") Reported-by: Aaron Lu <[email protected]> Signed-off-by: Mathieu Desnoyers <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Aaron Lu <[email protected]> Link: https://lore.kernel.org/lkml/20230327080502.GA570847@ziqianlu-desk2/
2023-04-21Merge branch 'v6.3-rc7'Peter Zijlstra27-33/+148
Sync with the urgent patches; in particular: a53ce18cacb4 ("sched/fair: Sanitize vruntime of entity being migrated") Signed-off-by: Peter Zijlstra <[email protected]>
2023-04-21pds_core: publish events to the clientsShannon Nelson1-0/+2
When the Core device gets an event from the device, or notices the device FW to be up or down, it needs to send those events on to the clients that have an event handler. Add the code to pass along the events to the clients. The entry points pdsc_register_notify() and pdsc_unregister_notify() are EXPORTed for other drivers that want to listen for these events. Signed-off-by: Shannon Nelson <[email protected]> Acked-by: Jakub Kicinski <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-04-21pds_core: add the aux client APIShannon Nelson2-0/+8
Add the client API operations for running adminq commands. The core registers the client with the FW, then the client has a context for requesting adminq services. We expect to add additional operations for other clients, including requesting additional private adminqs and IRQs, but don't have the need yet. Signed-off-by: Shannon Nelson <[email protected]> Acked-by: Jakub Kicinski <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-04-21pds_core: add auxiliary_bus devicesShannon Nelson2-0/+15
An auxiliary_bus device is created for each vDPA type VF at VF probe and destroyed at VF remove. The aux device name comes from the driver name + VIF type + the unique id assigned at PCI probe. The VFs are always removed on PF remove, so there should be no issues with VFs trying to access missing PF structures. The auxiliary_device names will look like "pds_core.vDPA.nn" where 'nn' is the VF's uid. Signed-off-by: Shannon Nelson <[email protected]> Acked-by: Jakub Kicinski <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-04-21pds_core: set up the VIF definitions and defaultsShannon Nelson1-0/+19
The Virtual Interfaces (VIFs) supported by the DSC's configuration (vDPA, Eth, RDMA, etc) are reported in the dev_ident struct and made visible in debugfs. At this point only vDPA is supported in this driver so we only setup devices for that feature. Signed-off-by: Shannon Nelson <[email protected]> Acked-by: Jakub Kicinski <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-04-21pds_core: Add adminq processing and commandsShannon Nelson1-1/+10
Add the service routines for submitting and processing the adminq messages and for handling notifyq events. Signed-off-by: Shannon Nelson <[email protected]> Acked-by: Jakub Kicinski <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-04-21pds_core: set up device and adminqShannon Nelson1-0/+638
Set up the basic adminq and notifyq queue structures. These are used mostly by the client drivers for feature configuration. These are essentially the same adminq and notifyq as in the ionic driver. Part of this includes querying for device identity and FW information, so we can make that available to devlink dev info. $ devlink dev info pci/0000:b5:00.0 pci/0000:b5:00.0: driver pds_core serial_number FLM18420073 versions: fixed: asic.id 0x0 asic.rev 0x0 running: fw 1.51.0-73 stored: fw.goldfw 1.15.9-C-22 fw.mainfwa 1.60.0-73 fw.mainfwb 1.60.0-57 Signed-off-by: Shannon Nelson <[email protected]> Acked-by: Jakub Kicinski <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-04-21pds_core: add devcmd device interfacesShannon Nelson2-0/+193
The devcmd interface is the basic connection to the device through the PCI BAR for low level identification and command services. This does the early device initialization and finds the identity data, and adds devcmd routines to be used by later driver bits. Signed-off-by: Shannon Nelson <[email protected]> Acked-by: Jakub Kicinski <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-04-21pds_core: initial framework for pds_core PF driverShannon Nelson2-0/+585
This is the initial PCI driver framework for the new pds_core device driver and its family of devices. This does the very basics of registering for the new PF PCI device 1dd8:100c, setting up debugfs entries, and registering with devlink. Signed-off-by: Shannon Nelson <[email protected]> Acked-by: Jakub Kicinski <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-04-21bridge: Add internal flags for per-{Port, VLAN} neighbor suppressionIdo Schimmel1-0/+1
Add two internal flags that will be used to enable / disable per-{Port, VLAN} neighbor suppression: 1. 'BR_NEIGH_VLAN_SUPPRESS': A per-port flag used to indicate that per-{Port, VLAN} neighbor suppression is enabled on the bridge port. When set, 'BR_NEIGH_SUPPRESS' has no effect. 2. 'BR_VLFLAG_NEIGH_SUPPRESS_ENABLED': A per-VLAN flag used to indicate that neighbor suppression is enabled on the given VLAN. Signed-off-by: Ido Schimmel <[email protected]> Acked-by: Nikolay Aleksandrov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-04-21sctp: delete the nested flexible array payloadXin Long1-1/+1
This patch deletes the flexible-array payload[] from the structure sctp_datahdr to avoid some sparse warnings: # make C=2 CF="-Wflexible-array-nested" M=./net/sctp/ net/sctp/socket.c: note: in included file (through include/net/sctp/structs.h, include/net/sctp/sctp.h): ./include/linux/sctp.h:230:29: warning: nested flexible array This member is not even used anywhere. Signed-off-by: Xin Long <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-04-21sctp: delete the nested flexible array hmacXin Long1-1/+1
This patch deletes the flexible-array hmac[] from the structure sctp_authhdr to avoid some sparse warnings: # make C=2 CF="-Wflexible-array-nested" M=./net/sctp/ net/sctp/auth.c: note: in included file (through include/net/sctp/structs.h, include/net/sctp/sctp.h): ./include/linux/sctp.h:735:29: warning: nested flexible array Signed-off-by: Xin Long <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-04-21sctp: delete the nested flexible array variableXin Long1-2/+2
This patch deletes the flexible-array variable[] from the structure sctp_sackhdr and sctp_errhdr to avoid some sparse warnings: # make C=2 CF="-Wflexible-array-nested" M=./net/sctp/ net/sctp/sm_statefuns.c: note: in included file (through include/net/sctp/structs.h, include/net/sctp/sctp.h): ./include/linux/sctp.h:451:28: warning: nested flexible array ./include/linux/sctp.h:393:29: warning: nested flexible array Signed-off-by: Xin Long <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-04-21sctp: delete the nested flexible array skipXin Long1-2/+2
This patch deletes the flexible-array skip[] from the structure sctp_ifwdtsn/fwdtsn_hdr to avoid some sparse warnings: # make C=2 CF="-Wflexible-array-nested" M=./net/sctp/ net/sctp/stream_interleave.c: note: in included file (through include/net/sctp/structs.h, include/net/sctp/sctp.h): ./include/linux/sctp.h:611:32: warning: nested flexible array ./include/linux/sctp.h:628:33: warning: nested flexible array Signed-off-by: Xin Long <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-04-21sctp: delete the nested flexible array paramsXin Long1-3/+3
This patch deletes the flexible-array params[] from the structure sctp_inithdr, sctp_addiphdr and sctp_reconf_chunk to avoid some sparse warnings: # make C=2 CF="-Wflexible-array-nested" M=./net/sctp/ net/sctp/input.c: note: in included file (through include/net/sctp/structs.h, include/net/sctp/sctp.h): ./include/linux/sctp.h:278:29: warning: nested flexible array ./include/linux/sctp.h:675:30: warning: nested flexible array This warning is reported if a structure having a flexible array member is included by other structures. Signed-off-by: Xin Long <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-04-21virtio_ring: add a struct device forward declarationShunsuke Mie1-0/+1
The virtio_ring header file uses the struct device without a forward declaration. Signed-off-by: Shunsuke Mie <[email protected]> Message-Id: <[email protected]> Signed-off-by: Michael S. Tsirkin <[email protected]>
2023-04-21virtio-vdpa: add VIRTIO_F_NOTIFICATION_DATA feature supportAlvaro Karsz1-0/+9
Add VIRTIO_F_NOTIFICATION_DATA support for vDPA transport. If this feature is negotiated, the driver passes extra data when kicking a virtqueue. A device that offers this feature needs to implement the kick_vq_with_data callback. kick_vq_with_data receives the vDPA device and data. data includes: 16 bits vqn and 16 bits next available index for split virtqueues. 16 bits vqs, 15 least significant bits of next available index and 1 bit next_wrap for packed virtqueues. This patch follows a patch [1] by Viktor Prutyanov which adds support for the MMIO, channel I/O and modern PCI transports. Signed-off-by: Alvaro Karsz <[email protected]> Message-Id: <[email protected]> Signed-off-by: Michael S. Tsirkin <[email protected]> Acked-by: Jason Wang <[email protected]>
2023-04-21virtio: add VIRTIO_F_NOTIFICATION_DATA feature supportViktor Prutyanov1-0/+2
According to VirtIO spec v1.2, VIRTIO_F_NOTIFICATION_DATA feature indicates that the driver passes extra data along with the queue notifications. In a split queue case, the extra data is 16-bit available index. In a packed queue case, the extra data is 1-bit wrap counter and 15-bit available index. Add support for this feature for MMIO, channel I/O and modern PCI transports. Signed-off-by: Viktor Prutyanov <[email protected]> Acked-by: Jason Wang <[email protected]> Reviewed-by: Xuan Zhuo <[email protected]> Message-Id: <[email protected]> Signed-off-by: Michael S. Tsirkin <[email protected]>
2023-04-21vringh: address kdoc warningsSimon Horman1-2/+15
Address some minor kdoc warnings in vring.h. * Place kdoc for 'struct vringh_config_ops' immediately before the structure * Add missing documentation of members of 'vringh_iov' and 'vringh_kiov' Warnings flagged by: $ ./scripts/kernel-doc -none include/linux/vringh.h include/linux/vringh.h:68: error: Cannot parse struct or union! include/linux/vringh.h:92: warning: Function parameter or member 'iov' not described in 'vringh_iov' include/linux/vringh.h:92: warning: Function parameter or member 'consumed' not described in 'vringh_iov' include/linux/vringh.h:92: warning: Function parameter or member 'i' not described in 'vringh_iov' include/linux/vringh.h:92: warning: Function parameter or member 'used' not described in 'vringh_iov' include/linux/vringh.h:92: warning: Function parameter or member 'max_num' not described in 'vringh_iov' include/linux/vringh.h:104: warning: Function parameter or member 'iov' not described in 'vringh_kiov' include/linux/vringh.h:104: warning: Function parameter or member 'consumed' not described in 'vringh_kiov' include/linux/vringh.h:104: warning: Function parameter or member 'i' not described in 'vringh_kiov' include/linux/vringh.h:104: warning: Function parameter or member 'used' not described in 'vringh_kiov' include/linux/vringh.h:104: warning: Function parameter or member 'max_num' not described in 'vringh_kiov' Signed-off-by: Simon Horman <[email protected]> Message-Id: <[email protected]> Signed-off-by: Michael S. Tsirkin <[email protected]>
2023-04-21vdpa: address kdoc warningsSimon Horman1-3/+11
This patch addresses the following minor kdoc problems. * Incorrect spelling of 'callback' and 'notification' * Unrecognised kdoc format for 'struct vdpa_map_file' * Missing documentation of 'get_vendor_vq_stats' member of 'struct vdpa_config_ops' * Missing documentation of 'max_supported_vqs' and 'supported_features' members of 'struct vdpa_mgmt_dev' Most of these problems were flagged by: $ ./scripts/kernel-doc -Werror -none include/linux/vdpa.h include/linux/vdpa.h:20: warning: expecting prototype for struct vdpa_calllback. Prototype was for struct vdpa_callback instead include/linux/vdpa.h:117: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst * Corresponding file area for device memory mapping include/linux/vdpa.h:357: warning: Function parameter or member 'get_vendor_vq_stats' not described in 'vdpa_config_ops' include/linux/vdpa.h:518: warning: Function parameter or member 'supported_features' not described in 'vdpa_mgmt_dev' include/linux/vdpa.h:518: warning: Function parameter or member 'max_supported_vqs' not described in 'vdpa_mgmt_dev' The misspelling of 'notification' was flagged by: $ ./scripts/checkpatch.pl --codespell --showfile --strict -f include/linux/vdpa.h include/linux/vdpa.h:171: CHECK: 'notifcation' may be misspelled - perhaps 'notification'? ... Signed-off-by: Simon Horman <[email protected]> Message-Id: <[email protected]> Signed-off-by: Michael S. Tsirkin <[email protected]> Acked-by: Jason Wang <[email protected]>
2023-04-21vringh: support VA with iotlbStefano Garzarella1-0/+9
vDPA supports the possibility to use user VA in the iotlb messages. So, let's add support for user VA in vringh to use it in the vDPA simulators. Acked-by: Eugenio Pérez <[email protected]> Signed-off-by: Stefano Garzarella <[email protected]> Message-Id: <[email protected]> Signed-off-by: Michael S. Tsirkin <[email protected]>
2023-04-21vdpa: add bind_mm/unbind_mm callbacksStefano Garzarella1-0/+10
These new optional callbacks is used to bind/unbind the device to a specific address space so the vDPA framework can use VA when these callbacks are implemented. Suggested-by: Jason Wang <[email protected]> Acked-by: Jason Wang <[email protected]> Signed-off-by: Stefano Garzarella <[email protected]> Message-Id: <[email protected]> Signed-off-by: Michael S. Tsirkin <[email protected]>
2023-04-21vdpa: Add eventfd for the vdpa callbackXie Yongji1-0/+6
Add eventfd for the vdpa callback so that user can signal it directly instead of triggering the callback. It will be used for vhost-vdpa case. Signed-off-by: Xie Yongji <[email protected]> Message-Id: <[email protected]> Signed-off-by: Michael S. Tsirkin <[email protected]> Acked-by: Jason Wang <[email protected]>
2023-04-21vdpa: Add set/get_vq_affinity callbacks in vdpa_config_opsXie Yongji1-0/+13
This introduces set/get_vq_affinity callbacks in vdpa_config_ops to support virtqueue affinity management for vdpa device drivers. Signed-off-by: Xie Yongji <[email protected]> Acked-by: Jason Wang <[email protected]> Message-Id: <[email protected]> Signed-off-by: Michael S. Tsirkin <[email protected]>
2023-04-21virtio_ring: Use const to annotate read-only pointer paramsFeng Liu1-7/+7
Add const to make the read-only pointer parameters clear, similar to many existing functions. To implement this change, the commit also introduces the use of `container_of_const` to implement `to_vvq`, which ensures the const-ness of read-only parameters and avoids accidental modification of their members. Signed-off-by: Feng Liu <[email protected]> Reviewed-by: Jiri Pirko <[email protected]> Reviewed-by: Parav Pandit <[email protected]> Reviewed-by: Gavin Li <[email protected]> Reviewed-by: Bodong Wang <[email protected]> Message-Id: <[email protected]> Signed-off-by: Michael S. Tsirkin <[email protected]>
2023-04-21virtio: Reorder fields in 'struct virtqueue'Christophe JAILLET1-1/+1
Group some variables based on their sizes to reduce hole and avoid padding. On x86_64, this shrinks the size of 'struct virtqueue' from 72 to 68 bytes. It saves a few bytes of memory. Signed-off-by: Christophe JAILLET <[email protected]> Message-Id: <8f3d2e49270a2158717e15008e7ed7228196ba02.1676707807.git.christophe.jaillet@wanadoo.fr> Signed-off-by: Michael S. Tsirkin <[email protected]> Acked-by: Peter Lafreniere <[email protected]> Reviewed-by: Stefano Garzarella <[email protected]>
2023-04-20net: move dropreason.h to dropreason-core.hJohannes Berg2-2/+2
This will, after the next patch, hold only the core drop reasons and minimal infrastructure. Fix a small kernel-doc issue while at it, to avoid the move triggering a checker. Signed-off-by: Johannes Berg <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
2023-04-20kill the last remaining user of proc_ns_fget()Al Viro1-1/+0
lookups by descriptor are better off closer to syscall surface... Reviewed-by: Christian Brauner <[email protected]> Signed-off-by: Al Viro <[email protected]>
2023-04-20fs: introduce lock_rename_child() helperAl Viro1-0/+1
Pass the dentry of a source file and the dentry of a destination directory to lock parent inodes for rename. As soon as this function returns, ->d_parent of the source file dentry is stable and inodes are properly locked for calling vfs-rename. This helper is needed for ksmbd server. rename request of SMB protocol has to rename an opened file, no matter which directory it's in. Signed-off-by: Al Viro <[email protected]> Signed-off-by: Namjae Jeon <[email protected]> Signed-off-by: Al Viro <[email protected]>
2023-04-20ksmbd: remove internal.h includeNamjae Jeon1-0/+2
Since vfs_path_lookup is exported, It should not be internal. Move vfs_path_lookup prototype in internal.h to linux/namei.h. Suggested-by: Al Viro <[email protected]> Reviewed-by: Christian Brauner <[email protected]> Signed-off-by: Namjae Jeon <[email protected]> Signed-off-by: Al Viro <[email protected]>
2023-04-20net: skbuff: update and rename __kfree_skb_defer()Jakub Kicinski1-1/+1
__kfree_skb_defer() uses the old naming where "defer" meant slab bulk free/alloc APIs. In the meantime we also made __kfree_skb_defer() feed the per-NAPI skb cache, which implies bulk APIs. So take away the 'defer' and add 'napi'. While at it add a drop reason. This only matters on the tx_action path, if the skb has a frag_list. But getting rid of a SKB_DROP_REASON_NOT_SPECIFIED seems like a net benefit so why not. Reviewed-by: Alexander Lobakin <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-04-20net/mlx5: Update op_mode to op_mod for port selectionRoi Dayan1-1/+1
To be consistent with the other enum keys use OP_MOD instead of OP_MODE. Signed-off-by: Roi Dayan <[email protected]> Reviewed-by: Maor Dickman <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>