aboutsummaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)AuthorFilesLines
2023-08-25net: handle ARPHRD_PPP in dev_is_mac_header_xmit()Nicolas Dichtel1-0/+4
The goal is to support a bpf_redirect() from an ethernet device (ingress) to a ppp device (egress). The l2 header is added automatically by the ppp driver, thus the ethernet header should be removed. CC: [email protected] Fixes: 27b29f63058d ("bpf: add bpf_redirect() helper") Signed-off-by: Nicolas Dichtel <[email protected]> Tested-by: Siwar Zitouni <[email protected]> Reviewed-by: Guillaume Nault <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-08-25net: pcs: xpcs: support to switch mode for Wangxun NICsJiawen Wu1-0/+1
According to chapter 6 of DesignWare Cores Ethernet PCS (version 3.20a) and custom design manual, add a configuration flow for switching interface mode. If the interface changes, the following setting is required: 1. wait VR_XS_PCS_DIG_STS bit(4, 2) [PSEQ_STATE] = 100b (Power-Good) 2. write SR_XS_PCS_CTRL2 to select various PCS type 3. write SR_PMA_CTRL1 and/or SR_XS_PCS_CTRL1 for link speed 4. program PMA registers 5. write VR_XS_PCS_DIG_CTRL1 bit(15) [VR_RST] = 1b (Vendor-Specific Soft Reset) 6. wait for VR_XS_PCS_DIG_CTRL1 bit(15) [VR_RST] to get cleared Only 10GBASE-R/SGMII/1000BASE-X modes are planned for the current Wangxun devices. And there is a quirk for Wangxun devices to switch mode although the interface in phylink state has not changed, since PCS will change to default 10GBASE-R when the ethernet driver(txgbe) do LAN reset. Signed-off-by: Jiawen Wu <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-08-25net: pcs: xpcs: add specific vendor supoprt for Wangxun 10Gb NICsJiawen Wu1-0/+7
Since Wangxun 10Gb NICs require some special configuration on the IP of Synopsys Designware XPCS, introduce dev_flag for different vendors. Read OUI from device identifier registers, to detect Wangxun devices. And xpcs_soft_reset() is skipped to avoid the reset of device identifier registers. Signed-off-by: Jiawen Wu <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-08-24Merge tag 'trace-v6.5-rc6' of ↵Linus Torvalds1-0/+11
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix ring buffer being permanently disabled due to missed record_disabled() Changing the trace cpu mask will disable the ring buffers for the CPUs no longer in the mask. But it fails to update the snapshot buffer. If a snapshot takes place, the accounting for the ring buffer being disabled is corrupted and this can lead to the ring buffer being permanently disabled. - Add test case for snapshot and cpu mask working together - Fix memleak by the function graph tracer not getting closed properly. The iterator is used to read the ring buffer. When it opens, it calls the open function of a tracer, and when it is closed, it calls the close iteration. While a trace is being read, it is still possible to change the tracer. If this happens between the function graph tracer and the wakeup tracer (which uses function graph tracing), the tracers are not closed properly during when the iterator sees the switch, and the wakeup function did not initialize its private pointer to NULL, which is used to know if the function graph tracer was the last tracer. It could be fooled in thinking it is, but then on exit it does not call the close function of the function graph tracer to clean up its data. - Fix synthetic events on big endian machines, by introducing a union that does the conversions properly. - Fix synthetic events from printing out the number of elements in the stacktrace when it shouldn't. - Fix synthetic events stacktrace to not print a bogus value at the end. - Introduce a pipe_cpumask that prevents the trace_pipe files from being opened by more than one task (file descriptor). There was a race found where if splice is called, the iter->ent could become stale and events could be missed. There's no point reading a producer/consumer file by more than one task as they will corrupt each other anyway. Add a cpumask that keeps track of the per_cpu trace_pipe files as well as the global trace_pipe file that prevents more than one open of a trace_pipe file that represents the same ring buffer. This prevents the race from happening. - Fix ftrace samples for arm64 to work with older compilers. * tag 'trace-v6.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: samples: ftrace: Replace bti assembly with hint for older compiler tracing: Introduce pipe_cpumask to avoid race on trace_pipes tracing: Fix memleak due to race between current_tracer and trace tracing/synthetic: Allocate one additional element for size tracing/synthetic: Skip first entry for stack traces tracing/synthetic: Use union instead of casts selftests/ftrace: Add a basic testcase for snapshot tracing: Fix cpu buffers unavailable due to 'record_disabled' missed
2023-08-24scsi: core: raid_class: Remove raid_component_add()Zhu Wang1-4/+0
The raid_component_add() function was added to the kernel tree via patch "[SCSI] embryonic RAID class" (2005). Remove this function since it never has had any callers in the Linux kernel. And also raid_component_release() is only used in raid_component_add(), so it is also removed. Signed-off-by: Zhu Wang <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Bart Van Assche <[email protected]> Fixes: 04b5b5cb0136 ("scsi: core: Fix possible memory leak if device_add() fails") Signed-off-by: Martin K. Petersen <[email protected]>
2023-08-24document while_each_thread(), change first_tid() to use for_each_thread()Oleg Nesterov1-0/+4
Add the comment to explain that while_each_thread(g,t) is not rcu-safe unless g is stable (e.g. current). Even if g is a group leader and thus can't exit before t, t or another sub-thread can exec and remove g from the thread_group list. The only lockless user of while_each_thread() is first_tid() and it is fine in that it can't loop forever, yet for_each_thread() looks better and I am going to change while_each_thread/next_thread. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Oleg Nesterov <[email protected]> Cc: Eric W. Biederman <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-24crash: hotplug support for kexec_load()Eric DeVolder1-2/+12
The hotplug support for kexec_load() requires changes to the userspace kexec-tools and a little extra help from the kernel. Given a kdump capture kernel loaded via kexec_load(), and a subsequent hotplug event, the crash hotplug handler finds the elfcorehdr and rewrites it to reflect the hotplug change. That is the desired outcome, however, at kernel panic time, the purgatory integrity check fails (because the elfcorehdr changed), and the capture kernel does not boot and no vmcore is generated. Therefore, the userspace kexec-tools/kexec must indicate to the kernel that the elfcorehdr can be modified (because the kexec excluded the elfcorehdr from the digest, and sized the elfcorehdr memory buffer appropriately). To facilitate hotplug support with kexec_load(): - a new kexec flag KEXEC_UPATE_ELFCOREHDR indicates that it is safe for the kernel to modify the kexec_load()'d elfcorehdr - the /sys/kernel/crash_elfcorehdr_size node communicates the preferred size of the elfcorehdr memory buffer - The sysfs crash_hotplug nodes (ie. /sys/devices/system/[cpu|memory]/crash_hotplug) dynamically take into account kexec_file_load() vs kexec_load() and KEXEC_UPDATE_ELFCOREHDR. This is critical so that the udev rule processing of crash_hotplug is all that is needed to determine if the userspace unload-then-load of the kdump image is to be skipped, or not. The proposed udev rule change looks like: # The kernel updates the crash elfcorehdr for CPU and memory changes SUBSYSTEM=="cpu", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end" SUBSYSTEM=="memory", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end" The table below indicates the behavior of kexec_load()'d kdump image updates (with the new udev crash_hotplug rule in place): Kernel |Kexec -------+-----+---- Old |Old |New | a | a -------+-----+---- New | a | b -------+-----+---- where kexec 'old' and 'new' delineate kexec-tools has the needed modifications for the crash hotplug feature, and kernel 'old' and 'new' delineate the kernel supports this crash hotplug feature. Behavior 'a' indicates the unload-then-reload of the entire kdump image. For the kexec 'old' column, the unload-then-reload occurs due to the missing flag KEXEC_UPDATE_ELFCOREHDR. An 'old' kernel (with 'new' kexec) does not present the crash_hotplug sysfs node, which leads to the unload-then-reload of the kdump image. Behavior 'b' indicates the desired optimized behavior of the kernel directly modifying the elfcorehdr and avoiding the unload-then-reload of the kdump image. If the udev rule is not updated with crash_hotplug node check, then no matter any combination of kernel or kexec is new or old, the kdump image continues to be unload-then-reload on hotplug changes. To fully support crash hotplug feature, there needs to be a rollout of kernel, kexec-tools and udev rule changes. However, the order of the rollout of these pieces does not matter; kexec_load()'d kdump images still function for hotplug as-is. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Eric DeVolder <[email protected]> Suggested-by: Hari Bathini <[email protected]> Acked-by: Hari Bathini <[email protected]> Acked-by: Baoquan He <[email protected]> Cc: Akhil Raj <[email protected]> Cc: Bjorn Helgaas <[email protected]> Cc: Borislav Petkov (AMD) <[email protected]> Cc: Boris Ostrovsky <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Dave Young <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: Mimi Zohar <[email protected]> Cc: Naveen N. Rao <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Sean Christopherson <[email protected]> Cc: Sourabh Jain <[email protected]> Cc: Takashi Iwai <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Thomas Weißschuh <[email protected]> Cc: Valentin Schneider <[email protected]> Cc: Vivek Goyal <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-24crash: memory and CPU hotplug sysfs attributesEric DeVolder1-0/+8
Introduce the crash_hotplug attribute for memory and CPUs for use by userspace. These attributes directly facilitate the udev rule for managing userspace re-loading of the crash kernel upon hot un/plug changes. For memory, expose the crash_hotplug attribute to the /sys/devices/system/memory directory. For example: # udevadm info --attribute-walk /sys/devices/system/memory/memory81 looking at device '/devices/system/memory/memory81': KERNEL=="memory81" SUBSYSTEM=="memory" DRIVER=="" ATTR{online}=="1" ATTR{phys_device}=="0" ATTR{phys_index}=="00000051" ATTR{removable}=="1" ATTR{state}=="online" ATTR{valid_zones}=="Movable" looking at parent device '/devices/system/memory': KERNELS=="memory" SUBSYSTEMS=="" DRIVERS=="" ATTRS{auto_online_blocks}=="offline" ATTRS{block_size_bytes}=="8000000" ATTRS{crash_hotplug}=="1" For CPUs, expose the crash_hotplug attribute to the /sys/devices/system/cpu directory. For example: # udevadm info --attribute-walk /sys/devices/system/cpu/cpu0 looking at device '/devices/system/cpu/cpu0': KERNEL=="cpu0" SUBSYSTEM=="cpu" DRIVER=="processor" ATTR{crash_notes}=="277c38600" ATTR{crash_notes_size}=="368" ATTR{online}=="1" looking at parent device '/devices/system/cpu': KERNELS=="cpu" SUBSYSTEMS=="" DRIVERS=="" ATTRS{crash_hotplug}=="1" ATTRS{isolated}=="" ATTRS{kernel_max}=="8191" ATTRS{nohz_full}==" (null)" ATTRS{offline}=="4-7" ATTRS{online}=="0-3" ATTRS{possible}=="0-7" ATTRS{present}=="0-3" With these sysfs attributes in place, it is possible to efficiently instruct the udev rule to skip crash kernel reloading for kernels configured with crash hotplug support. For example, the following is the proposed udev rule change for RHEL system 98-kexec.rules (as the first lines of the rule file): # The kernel updates the crash elfcorehdr for CPU and memory changes SUBSYSTEM=="cpu", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end" SUBSYSTEM=="memory", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end" When examined in the context of 98-kexec.rules, the above rules test if crash_hotplug is set, and if so, the userspace initiated unload-then-reload of the crash kernel is skipped. CPU and memory checks are separated in accordance with CONFIG_HOTPLUG_CPU and CONFIG_MEMORY_HOTPLUG kernel config options. If an architecture supports, for example, memory hotplug but not CPU hotplug, then the /sys/devices/system/memory/crash_hotplug attribute file is present, but the /sys/devices/system/cpu/crash_hotplug attribute file will NOT be present. Thus the udev rule skips userspace processing of memory hot un/plug events, but the udev rule will evaluate false for CPU events, thus allowing userspace to process CPU hot un/plug events (ie the unload-then-reload of the kdump capture kernel). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Eric DeVolder <[email protected]> Reviewed-by: Sourabh Jain <[email protected]> Acked-by: Hari Bathini <[email protected]> Acked-by: Baoquan He <[email protected]> Cc: Akhil Raj <[email protected]> Cc: Bjorn Helgaas <[email protected]> Cc: Borislav Petkov (AMD) <[email protected]> Cc: Boris Ostrovsky <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Dave Young <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: Mimi Zohar <[email protected]> Cc: Naveen N. Rao <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Sean Christopherson <[email protected]> Cc: Takashi Iwai <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Thomas Weißschuh <[email protected]> Cc: Valentin Schneider <[email protected]> Cc: Vivek Goyal <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-24crash: add generic infrastructure for crash hotplug supportEric DeVolder2-0/+18
To support crash hotplug, a mechanism is needed to update the crash elfcorehdr upon CPU or memory changes (eg. hot un/plug or off/ onlining). The crash elfcorehdr describes the CPUs and memory to be written into the vmcore. To track CPU changes, callbacks are registered with the cpuhp mechanism via cpuhp_setup_state_nocalls(CPUHP_BP_PREPARE_DYN). The crash hotplug elfcorehdr update has no explicit ordering requirement (relative to other cpuhp states), so meets the criteria for utilizing CPUHP_BP_PREPARE_DYN. CPUHP_BP_PREPARE_DYN is a dynamic state and avoids the need to introduce a new state for crash hotplug. Also, CPUHP_BP_PREPARE_DYN is the last state in the PREPARE group, just prior to the STARTING group, which is very close to the CPU starting up in a plug/online situation, or stopping in a unplug/ offline situation. This minimizes the window of time during an actual plug/online or unplug/offline situation in which the elfcorehdr would be inaccurate. Note that for a CPU being unplugged or offlined, the CPU will still be present in the list of CPUs generated by crash_prepare_elf64_headers(). However, there is no need to explicitly omit the CPU, see justification in 'crash: change crash_prepare_elf64_headers() to for_each_possible_cpu()'. To track memory changes, a notifier is registered to capture the memblock MEM_ONLINE and MEM_OFFLINE events via register_memory_notifier(). The CPU callbacks and memory notifiers invoke crash_handle_hotplug_event() which performs needed tasks and then dispatches the event to the architecture specific arch_crash_handle_hotplug_event() to update the elfcorehdr with the current state of CPUs and memory. During the process, the kexec_lock is held. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Eric DeVolder <[email protected]> Reviewed-by: Sourabh Jain <[email protected]> Acked-by: Hari Bathini <[email protected]> Acked-by: Baoquan He <[email protected]> Cc: Akhil Raj <[email protected]> Cc: Bjorn Helgaas <[email protected]> Cc: Borislav Petkov (AMD) <[email protected]> Cc: Boris Ostrovsky <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Dave Young <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: Mimi Zohar <[email protected]> Cc: Naveen N. Rao <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Sean Christopherson <[email protected]> Cc: Takashi Iwai <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Thomas Weißschuh <[email protected]> Cc: Valentin Schneider <[email protected]> Cc: Vivek Goyal <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-24crash: move a few code bits to setup support of crash hotplugEric DeVolder2-15/+20
Patch series "crash: Kernel handling of CPU and memory hot un/plug", v28. Once the kdump service is loaded, if changes to CPUs or memory occur, either by hot un/plug or off/onlining, the crash elfcorehdr must also be updated. The elfcorehdr describes to kdump the CPUs and memory in the system, and any inaccuracies can result in a vmcore with missing CPU context or memory regions. The current solution utilizes udev to initiate an unload-then-reload of the kdump image (eg. kernel, initrd, boot_params, purgatory and elfcorehdr) by the userspace kexec utility. In the original post I outlined the significant performance problems related to offloading this activity to userspace. This patchset introduces a generic crash handler that registers with the CPU and memory notifiers. Upon CPU or memory changes, from either hot un/plug or off/onlining, this generic handler is invoked and performs important housekeeping, for example obtaining the appropriate lock, and then invokes an architecture specific handler to do the appropriate elfcorehdr update. Note the description in patch 'crash: change crash_prepare_elf64_headers() to for_each_possible_cpu()' and 'x86/crash: optimize CPU changes' that enables further optimizations related to CPU plug/unplug/online/offline performance of elfcorehdr updates. In the case of x86_64, the arch specific handler generates a new elfcorehdr, and overwrites the old one in memory; thus no involvement with userspace needed. To realize the benefits/test this patchset, one must make a couple of minor changes to userspace: - Prevent udev from updating kdump crash kernel on hot un/plug changes. Add the following as the first lines to the RHEL udev rule file /usr/lib/udev/rules.d/98-kexec.rules: # The kernel updates the crash elfcorehdr for CPU and memory changes SUBSYSTEM=="cpu", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end" SUBSYSTEM=="memory", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end" With this changeset applied, the two rules evaluate to false for CPU and memory change events and thus skip the userspace unload-then-reload of kdump. - Change to the kexec_file_load for loading the kdump kernel: Eg. on RHEL: in /usr/bin/kdumpctl, change to: standard_kexec_args="-p -d -s" which adds the -s to select kexec_file_load() syscall. This kernel patchset also supports kexec_load() with a modified kexec userspace utility. A working changeset to the kexec userspace utility is posted to the kexec-tools mailing list here: http://lists.infradead.org/pipermail/kexec/2023-May/027049.html To use the kexec-tools patch, apply, build and install kexec-tools, then change the kdumpctl's standard_kexec_args to replace the -s with --hotplug. The removal of -s reverts to the kexec_load syscall and the addition of --hotplug invokes the changes put forth in the kexec-tools patch. This patch (of 8): The crash hotplug support leans on the work for the kexec_file_load() syscall. To also support the kexec_load() syscall, a few bits of code need to be move outside of CONFIG_KEXEC_FILE. As such, these bits are moved out of kexec_file.c and into a common location crash_core.c. In addition, struct crash_mem and crash_notes were moved to new locales so that PROC_KCORE, which sets CRASH_CORE alone, builds correctly. No functionality change intended. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Eric DeVolder <[email protected]> Reviewed-by: Sourabh Jain <[email protected]> Acked-by: Hari Bathini <[email protected]> Acked-by: Baoquan He <[email protected]> Cc: Akhil Raj <[email protected]> Cc: Bjorn Helgaas <[email protected]> Cc: Borislav Petkov (AMD) <[email protected]> Cc: Boris Ostrovsky <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Dave Young <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: Mimi Zohar <[email protected]> Cc: Naveen N. Rao <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Sean Christopherson <[email protected]> Cc: Takashi Iwai <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Thomas Weißschuh <[email protected]> Cc: Valentin Schneider <[email protected]> Cc: Vivek Goyal <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-24maple_tree: shrink struct maple_treeMateusz Guzik1-1/+1
Pack the members of struct maple_tree to avoid holes on 64-bit. The size shrinks from 24 to 16 bytes which will save eight bytes in every structure which embeds it. [[email protected]: changelog alterations] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mateusz Guzik <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-24secretmem: convert page_is_secretmem() to folio_is_secretmem()Matthew Wilcox (Oracle)1-8/+7
The only caller already has a folio, so use it to save calling compound_head() in PageLRU() and remove a use of page->mapping. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Mike Rapoport (IBM) <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-24mm: remove enum page_entry_sizeMatthew Wilcox (Oracle)2-11/+3
Remove the unnecessary encoding of page order into an enum and pass the page order directly. That lets us get rid of pe_order(). The switch constructs have to be changed to if/else constructs to prevent GCC from warning on builds with 3-level page tables where PMD_ORDER and PUD_ORDER have the same value. If you are looking at this commit because your driver stopped compiling, look at the previous commit as well and audit your driver to be sure it doesn't depend on mmap_lock being held in its ->huge_fault method. [[email protected]: use "order %u" to match the (non dev_t) style] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-24mm: move PMD_ORDER to pgtable.hMatthew Wilcox (Oracle)1-0/+3
Patch series "Change calling convention for ->huge_fault", v2. There are two unrelated changes to the calling convention for ->huge_fault. I've bundled them together to help people notice the change. The first is to improve scalability of DAX page faults by allowing them to be handled under the VMA lock. The second is to remove enum page_entry_size since it's really unnecessary. The changelogs and documentation updates hopefully work to that end. This patch (of 3): Allow this to be used in generic code. Also add PUD_ORDER. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-24mm: remove checks for pte_indexMatthew Wilcox (Oracle)1-1/+0
Since pte_index is always defined, we don't need to check whether it's defined or not. Delete the slow version that doesn't depend on it and remove the #define since nobody needs to test for it. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Mike Rapoport (IBM) <[email protected]> Cc: Christian Dietrich <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-24mm/swap: inline folio_set_swap_entry() and folio_swap_entry()David Hildenbrand1-11/+1
Let's simply work on the folio directly and remove the helpers. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Suggested-by: Matthew Wilcox <[email protected]> Reviewed-by: Chris Li <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Dan Streetman <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Peter Xu <[email protected]> Cc: Seth Jennings <[email protected]> Cc: Vitaly Wool <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-24mm/swap: use dedicated entry for swap in folioMatthew Wilcox2-13/+15
Let's stop working on the private field and use an explicit swap field. We have to move the swp_entry_t typedef. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox <[email protected]> Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Chris Li <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Dan Streetman <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Peter Xu <[email protected]> Cc: Seth Jennings <[email protected]> Cc: Vitaly Wool <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-24mm/swap: stop using page->private on tail pages for THP_SWAPDavid Hildenbrand2-11/+10
Patch series "mm/swap: stop using page->private on tail pages for THP_SWAP + cleanups". This series stops using page->private on tail pages for THP_SWAP, replaces folio->private by folio->swap for swapcache folios, and starts using "new_folio" for tail pages that we are splitting to remove the usage of page->private for swapcache handling completely. This patch (of 4): Let's stop using page->private on tail pages, making it possible to just unconditionally reuse that field in the tail pages of large folios. The remaining usage of the private field for THP_SWAP is in the THP splitting code (mm/huge_memory.c), that we'll handle separately later. Update the THP_SWAP documentation and sanity checks in mm_types.h and __split_huge_page_tail(). [[email protected]: stop using page->private on tail pages for THP_SWAP] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Acked-by: Catalin Marinas <[email protected]> [arm64] Reviewed-by: Yosry Ahmed <[email protected]> Cc: Dan Streetman <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Peter Xu <[email protected]> Cc: Seth Jennings <[email protected]> Cc: Vitaly Wool <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-24mm: convert do_set_pte() to set_pte_range()Yin Fengwei1-1/+2
set_pte_range() allows to setup page table entries for a specific range. It takes advantage of batched rmap update for large folio. It now takes care of calling update_mmu_cache_range(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yin Fengwei <[email protected]> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-24rmap: add folio_add_file_rmap_range()Yin Fengwei1-0/+2
folio_add_file_rmap_range() allows to add pte mapping to a specific range of file folio. Comparing to page_add_file_rmap(), it batched updates __lruvec_stat for large folio. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yin Fengwei <[email protected]> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-24mm: tidy up set_ptes definitionMatthew Wilcox (Oracle)1-6/+0
Now that all architectures are converted, we can remove the PFN_PTE_SHIFT ifdef and we can define set_pte_at() unconditionally. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Anshuman Khandual <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-24mm: rationalise flush_icache_pages() and flush_icache_page()Matthew Wilcox (Oracle)1-0/+9
Move the default (no-op) implementation of flush_icache_pages() to <linux/cacheflush.h> from <asm-generic/cacheflush.h>. Remove the flush_icache_page() wrapper from each architecture into <linux/cacheflush.h>. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-24mm: remove page_mapping_file()Matthew Wilcox (Oracle)1-8/+0
This function has no more users. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Anshuman Khandual <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-24mm: add default definition of set_ptes()Matthew Wilcox (Oracle)1-21/+60
Most architectures can just define set_pte() and PFN_PTE_SHIFT to use this definition. It's also a handy spot to document the guarantees provided by the MM. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Suggested-by: Mike Rapoport (IBM) <[email protected]> Reviewed-by: Mike Rapoport (IBM) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-24mm: remove ARCH_IMPLEMENTS_FLUSH_DCACHE_FOLIOMatthew Wilcox (Oracle)1-2/+2
Current best practice is to reuse the name of the function as a define to indicate that the function is implemented by the architecture. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Acked-by: Mike Rapoport (IBM) <[email protected]> Reviewed-by: Anshuman Khandual <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-24mm: add folio_flush_mapping()Matthew Wilcox (Oracle)1-5/+21
This is the folio equivalent of page_mapping_file(), but rename it to make it clear that it's very different from page_file_mapping(). Theoretically, there's nothing flush-only about it, but there are no other users today, and I doubt there will be; it's almost always more useful to know the swapfile's mapping or the swapcache's mapping. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Acked-by: Mike Rapoport (IBM) <[email protected]> Reviewed-by: Anshuman Khandual <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-24mm: convert page_table_check_pte_set() to page_table_check_ptes_set()Matthew Wilcox (Oracle)1-6/+7
Tell the page table check how many PTEs & PFNs we want it to check. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Mike Rapoport (IBM) <[email protected]> Acked-by: Pasha Tatashin <[email protected]> Reviewed-by: Anshuman Khandual <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-24minmax: add in_range() macroMatthew Wilcox (Oracle)1-0/+27
Patch series "New page table range API", v6. This patchset changes the API used by the MM to set up page table entries. The four APIs are: set_ptes(mm, addr, ptep, pte, nr) update_mmu_cache_range(vma, addr, ptep, nr) flush_dcache_folio(folio) flush_icache_pages(vma, page, nr) flush_dcache_folio() isn't technically new, but no architecture implemented it, so I've done that for them. The old APIs remain around but are mostly implemented by calling the new interfaces. The new APIs are based around setting up N page table entries at once. The N entries belong to the same PMD, the same folio and the same VMA, so ptep++ is a legitimate operation, and locking is taken care of for you. Some architectures can do a better job of it than just a loop, but I have hesitated to make too deep a change to architectures I don't understand well. One thing I have changed in every architecture is that PG_arch_1 is now a per-folio bit instead of a per-page bit when used for dcache clean/dirty tracking. This was something that would have to happen eventually, and it makes sense to do it now rather than iterate over every page involved in a cache flush and figure out if it needs to happen. The point of all this is better performance, and Fengwei Yin has measured improvement on x86. I suspect you'll see improvement on your architecture too. Try the new will-it-scale test mentioned here: https://lore.kernel.org/linux-mm/[email protected]/ You'll need to run it on an XFS filesystem and have CONFIG_TRANSPARENT_HUGEPAGE set. This patchset is the basis for much of the anonymous large folio work being done by Ryan, so it's received quite a lot of testing over the last few months. This patch (of 38): Determine if a value lies within a range more efficiently (subtraction + comparison vs two comparisons and an AND). It also has useful (under some circumstances) behaviour if the range exceeds the maximum value of the type. Convert all the conflicting definitions of in_range() within the kernel; some can use the generic definition while others need their own definition. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-24mm: memcg: use rstat for non-hierarchical statsYosry Ahmed1-3/+4
Currently, memcg uses rstat to maintain aggregated hierarchical stats. Counters are maintained for hierarchical stats at each memcg. Rstat tracks which cgroups have updates on which cpus to keep those counters fresh on the read-side. Non-hierarchical stats are currently not covered by rstat. Their per-cpu counters are summed up on every read, which is expensive. The original implementation did the same. At some point before rstat, non-hierarchical aggregated counters were introduced by commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in memory.stat reporting"). However, those counters were updated on the performance critical write-side, which caused regressions, so they were later removed by commit 815744d75152 ("mm: memcontrol: don't batch updates of local VM stats and events"). See [1] for more detailed history. Kernel versions in between a983b5ebee57 & 815744d75152 (a year and a half) enjoyed cheap reads of non-hierarchical stats, specifically on cgroup v1. When moving to more recent kernels, a performance regression for reading non-hierarchical stats is observed. Now that we have rstat, we know exactly which percpu counters have updates for each stat. We can maintain non-hierarchical counters again, making reads much more efficient, without affecting the performance critical write-side. Hence, add non-hierarchical (i.e local) counters for the stats, and extend rstat flushing to keep those up-to-date. A caveat is that we now need a stats flush before reading local/non-hierarchical stats through {memcg/lruvec}_page_state_local() or memcg_events_local(), where we previously only needed a flush to read hierarchical stats. Most contexts reading non-hierarchical stats are already doing a flush, add a flush to the only missing context in count_shadow_nodes(). With this patch, reading memory.stat from 1000 memcgs is 3x faster on a machine with 256 cpus on cgroup v1: # for i in $(seq 1000); do mkdir /sys/fs/cgroup/memory/cg$i; done # time cat /sys/fs/cgroup/memory/cg*/memory.stat > /dev/null real 0m0.125s user 0m0.005s sys 0m0.120s After: real 0m0.032s user 0m0.005s sys 0m0.027s To make sure there are no regressions on cgroup v2, I ran an artificial reclaim/refault stress test [2] that creates (NR_CPUS * 2) cgroups, assigns them limits, runs a worker process in each cgroup that allocates tmpfs memory equal to quadruple the limit (to invoke reclaim continuously), and then reads back the entire file (to invoke refaults). All workers are run in parallel, and zram is used as a swapping backend. Both reclaim and refault have conditional stats flushing. I ran this on a machine with 112 cpus, once on mm-unstable, and once on mm-unstable with this patch reverted. (1) A few runs without this patch: # time ./stress_reclaim_refault.sh real 0m9.949s user 0m0.496s sys 14m44.974s # time ./stress_reclaim_refault.sh real 0m10.049s user 0m0.486s sys 14m55.791s # time ./stress_reclaim_refault.sh real 0m9.984s user 0m0.481s sys 14m53.841s (2) A few runs with this patch: # time ./stress_reclaim_refault.sh real 0m9.885s user 0m0.486s sys 14m48.753s # time ./stress_reclaim_refault.sh real 0m9.903s user 0m0.495s sys 14m48.339s # time ./stress_reclaim_refault.sh real 0m9.861s user 0m0.507s sys 14m49.317s No regressions are observed with this patch. There is actually a very slight improvement. If I have to guess, maybe it's because we avoid the percpu loop in count_shadow_nodes() when calling lruvec_page_state_local(), but I could not prove this using perf, it's probably in the noise. [1] https://lore.kernel.org/lkml/[email protected]/ [2] https://lore.kernel.org/lkml/CAJD7tkb17x=qwoO37uxyYXLEUVp15BQKR+Xfh7Sg9Hx-wTQ_=w@mail.gmail.com/ Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yosry Ahmed <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Roman Gushchin <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Muchun Song <[email protected]> Cc: Shakeel Butt <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-24mm: handle userfaults under VMA lockSuren Baghdasaryan1-0/+20
Enable handle_userfault to operate under VMA lock by releasing VMA lock instead of mmap_lock and retrying. Note that FAULT_FLAG_RETRY_NOWAIT should never be used when handling faults under per-VMA lock protection because that would break the assumption that lock is dropped on retry. [[email protected]: fix a lockdep issue in vma_assert_write_locked] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Suren Baghdasaryan <[email protected]> Acked-by: Peter Xu <[email protected]> Cc: Alistair Popple <[email protected]> Cc: Al Viro <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Howells <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Hillf Danton <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jan Kara <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Josef Bacik <[email protected]> Cc: Laurent Dufour <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Punit Agrawal <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-24mm: handle swap page faults under per-VMA lockSuren Baghdasaryan1-0/+13
When page fault is handled under per-VMA lock protection, all swap page faults are retried with mmap_lock because folio_lock_or_retry has to drop and reacquire mmap_lock if folio could not be immediately locked. Follow the same pattern as mmap_lock to drop per-VMA lock when waiting for folio and retrying once folio is available. With this obstacle removed, enable do_swap_page to operate under per-VMA lock protection. Drivers implementing ops->migrate_to_ram might still rely on mmap_lock, therefore we have to fall back to mmap_lock in that particular case. Note that the only time do_swap_page calls synchronous swap_readpage is when SWP_SYNCHRONOUS_IO is set, which is only set for QUEUE_FLAG_SYNCHRONOUS devices: brd, zram and nvdimms (both btt and pmem). Therefore we don't sleep in this path, and there's no need to drop the mmap or per-VMA lock. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Suren Baghdasaryan <[email protected]> Tested-by: Alistair Popple <[email protected]> Reviewed-by: Alistair Popple <[email protected]> Acked-by: Peter Xu <[email protected]> Cc: Al Viro <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Howells <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Hillf Danton <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jan Kara <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Josef Bacik <[email protected]> Cc: Laurent Dufour <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Punit Agrawal <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-24mm: change folio_lock_or_retry to use vm_fault directlySuren Baghdasaryan1-5/+6
Change folio_lock_or_retry to accept vm_fault struct and return the vm_fault_t directly. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Suren Baghdasaryan <[email protected]> Suggested-by: Matthew Wilcox <[email protected]> Acked-by: Peter Xu <[email protected]> Cc: Alistair Popple <[email protected]> Cc: Al Viro <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Howells <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Hillf Danton <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jan Kara <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Josef Bacik <[email protected]> Cc: Laurent Dufour <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Punit Agrawal <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-24mm: add missing VM_FAULT_RESULT_TRACE name for VM_FAULT_COMPLETEDSuren Baghdasaryan1-1/+2
VM_FAULT_RESULT_TRACE should contain an element for every vm_fault_reason to be used as flag_array inside trace_print_flags_seq(). The element for VM_FAULT_COMPLETED is missing, add it. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Suren Baghdasaryan <[email protected]> Reviewed-by: Peter Xu <[email protected]> Cc: Alistair Popple <[email protected]> Cc: Al Viro <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Howells <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Hillf Danton <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jan Kara <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Josef Bacik <[email protected]> Cc: Laurent Dufour <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Punit Agrawal <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-24io_uring: move iopoll ctx fields aroundPavel Begunkov1-14/+11
Move poll_multi_queue and iopoll_list to the submission cache line, it doesn't make much sense to keep them separately, and is better place for it in general. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/5b03cf7e6652e350e6e70a917eec72ba9f33b97b.1692916914.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2023-08-24io_uring: move multishot cqe cache in ctxPavel Begunkov1-1/+2
We cache multishot CQEs before flushing them to the CQ in submit_state.cqe. It's a 16 entry cache totalling 256 bytes in the middle of the io_submit_state structure. Move it out of there, it should help with CPU caches for the submission state, and shouldn't affect cached CQEs. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/dbe1f39c043ee23da918836be44fcec252ce6711.1692916914.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2023-08-24io_uring: separate task_work/waiting cache linePavel Begunkov1-7/+12
task_work's are typically queued up from IRQ/softirq potentially by a random CPU like in case of networking. Batch ctx fields bouncing as this into a separate cache line. We also move ->cq_timeouts there because waiters have to read and check it. We can also conditionally hide ->cq_timeouts in the future from the CQ wait path as a not really useful rudiment. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/b7f3fcb5b6b9cca0238778262c1fdb7ada6286b7.1692916914.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2023-08-24io_uring: banish non-hot data to end of io_ring_ctxPavel Begunkov1-18/+19
Let's move all slow path, setup/init and so on fields to the end of io_ring_ctx, that makes ctx reorganisation later easier. That includes, page arrays used only on tear down, CQ overflow list, old provided buffer caches and used by io-wq poll hashes. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/fc471b63925a0bf90a34943c4d36163c523cfb43.1692916914.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2023-08-24io_uring: move non aligned field to the endPavel Begunkov1-18/+18
Move not cache aligned fields down in io_ring_ctx, should change anything, but makes further refactoring easier. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/518e95d7888e9d481b2c5968dcf3f23db9ea47a5.1692916914.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2023-08-24io_uring: compact SQ/CQ heads/tailsPavel Begunkov1-2/+2
Queues heads and tails cache line aligned. That makes sq, cq taking 4 lines or 5 lines if we include the rest of struct io_rings (e.g. sq_flags is frequently accessed). Since modern io_uring is mostly single threaded, it doesn't make much send to spread them as such, it wastes space and puts additional pressure on caches. Put them all into a single line. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/9c8deddf9a7ed32069235a530d1e117fb460bc4c.1692916914.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2023-08-24io_uring: merge iopoll and normal completion pathsPavel Begunkov1-0/+1
io_do_iopoll() and io_submit_flush_completions() are pretty similar, both filling CQEs and then free a list of requests. Don't duplicate it and make iopoll use __io_submit_flush_completions(), which also helps with inlining and other optimisations. For that, we need to first find all completed iopoll requests and splice them from the iopoll list and then pass it down. This adds one extra list traversal, which should be fine as requests will stay hot in cache. CQ locking is already conditional, introduce ->lockless_cq and skip locking for IOPOLL as it's protected by ->uring_lock. We also add a wakeup optimisation for IOPOLL to __io_cq_unlock_post(), so it works just like io_cqring_ev_posted_iopoll(). Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/3840473f5e8a960de35b77292026691880f6bdbc.1692916914.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2023-08-24io_uring: simplify big_cqe handlingPavel Begunkov1-10/+6
Don't keep big_cqe bits of req in a union with hash_node, find a separate space for it. It's bit safer, but also if we keep it always initialised, we can get rid of ugly REQ_F_CQE32_INIT handling. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/447aa1b2968978c99e655ba88db536e903df0fe9.1692916914.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2023-08-24PCI: Remove unused function declarationsYue Haibing1-1/+0
The following declarations have never been implemented since the beginning of git history, so remove them: u8 acpiphp_get_attention_status(struct acpiphp_slot *slot); u8 cpci_get_latch_status(struct slot *slot); u8 cpci_get_adapter_status(struct slot *slot); int ibmphp_get_total_hp_slots(void); void ibmphp_free_ibm_slot(struct slot *); void pdev_enable_device(struct pci_dev *); Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Yue Haibing <[email protected]> Signed-off-by: Bjorn Helgaas <[email protected]>
2023-08-24net: dsa: use capital "OR" for multiple licenses in SPDXKrzysztof Kozlowski1-1/+1
Documentation/process/license-rules.rst and checkpatch expect the SPDX identifier syntax for multiple licenses to use capital "OR". Correct it to keep consistent format and avoid copy-paste issues. Signed-off-by: Krzysztof Kozlowski <[email protected]> Reviewed-by: Kurt Kanzenbach <[email protected]> Reviewed-by: FLorian Fainelli <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-08-24Merge branch 'mlx5-next' of ↵Jakub Kicinski4-1/+87
https://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux Leon Romanovsky says: ==================== mlx5 MACsec RoCEv2 support From Patrisious: This series extends previously added MACsec offload support to cover RoCE traffic either. In order to achieve that, we need configure MACsec with offload between the two endpoints, like below: REMOTE_MAC=10:70:fd:43:71:c0 * ip addr add 1.1.1.1/16 dev eth2 * ip link set dev eth2 up * ip link add link eth2 macsec0 type macsec encrypt on * ip macsec offload macsec0 mac * ip macsec add macsec0 tx sa 0 pn 1 on key 00 dffafc8d7b9a43d5b9a3dfbbf6a30c16 * ip macsec add macsec0 rx port 1 address $REMOTE_MAC * ip macsec add macsec0 rx port 1 address $REMOTE_MAC sa 0 pn 1 on key 01 ead3664f508eb06c40ac7104cdae4ce5 * ip addr add 10.1.0.1/16 dev macsec0 * ip link set dev macsec0 up And in a similar manner on the other machine, while noting the keys order would be reversed and the MAC address of the other machine. RDMA traffic is separated through relevant GID entries and in case of IP ambiguity issue - meaning we have a physical GIDs and a MACsec GIDs with the same IP/GID, we disable our physical GID in order to force the user to only use the MACsec GID. v0: https://lore.kernel.org/netdev/[email protected]/ * 'mlx5-next' of https://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux: RDMA/mlx5: Handles RoCE MACsec steering rules addition and deletion net/mlx5: Add RoCE MACsec steering infrastructure in core net/mlx5: Configure MACsec steering for ingress RoCEv2 traffic net/mlx5: Configure MACsec steering for egress RoCEv2 traffic IB/core: Reorder GID delete code for RoCE net/mlx5: Add MACsec priorities in RDMA namespaces RDMA/mlx5: Implement MACsec gid addition and deletion net/mlx5: Maintain fs_id xarray per MACsec device inside macsec steering net/mlx5: Remove netdevice from MACsec steering net/mlx5e: Move MACsec flow steering and statistics database from ethernet to core net/mlx5e: Rename MACsec flow steering functions/parameters to suit core naming style net/mlx5: Remove dependency of macsec flow steering on ethernet net/mlx5e: Move MACsec flow steering operations to be used as core library macsec: add functions to get macsec real netdevice and check offload ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-08-24PCI/VGA: Fix typosSui Jingfeng1-2/+2
Fix typos, rewrap to fill 78 columns, convert to conventional multi-line style. [bhelgaas: squash and add more fixes] Link: https://lore.kernel.org/r/[email protected] Link: https://lore.kernel.org/r/[email protected] Link: https://lore.kernel.org/r/[email protected] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sui Jingfeng <[email protected]> Signed-off-by: Bjorn Helgaas <[email protected]>
2023-08-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-1/+2
Cross-merge networking fixes after downstream PR. Conflicts: include/net/inet_sock.h f866fbc842de ("ipv4: fix data-races around inet->inet_id") c274af224269 ("inet: introduce inet->inet_flags") https://lore.kernel.org/all/[email protected]/ Adjacent changes: drivers/net/bonding/bond_alb.c e74216b8def3 ("bonding: fix macvlan over alb bond support") f11e5bd159b0 ("bonding: support balance-alb with openvswitch") drivers/net/ethernet/broadcom/bgmac.c d6499f0b7c7c ("net: bgmac: Return PTR_ERR() for fixed_phy_register()") 23a14488ea58 ("net: bgmac: Fix return value check for fixed_phy_register()") drivers/net/ethernet/broadcom/genet/bcmmii.c 32bbe64a1386 ("net: bcmgenet: Fix return value check for fixed_phy_register()") acf50d1adbf4 ("net: bcmgenet: Return PTR_ERR() for fixed_phy_register()") net/sctp/socket.c f866fbc842de ("ipv4: fix data-races around inet->inet_id") b09bde5c3554 ("inet: move inet->mc_loop to inet->inet_frags") Signed-off-by: Jakub Kicinski <[email protected]>
2023-08-24SUNRPC: Allow specification of TCP client connect timeout at setupTrond Myklebust2-0/+4
When we create a TCP transport, the connect timeout parameters are currently fixed to be 90s. This is problematic in the pNFS flexfiles case, where we may have multiple mirrors, and we would like to fail over quickly to the next mirror if a data server is down. This patch adds the ability to specify the connection parameters at RPC client creation time. Signed-off-by: Trond Myklebust <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2023-08-24SUNRPC: clean up integer overflow checkDan Carpenter1-3/+1
This integer overflow check works as intended but Clang and GCC and warn about it when compiling with W=1. include/linux/sunrpc/xdr.h:539:17: error: comparison is always false due to limited range of data type [-Werror=type-limits] Use size_mul() to prevent the integer overflow. It silences the warning and it's cleaner as well. Reported-by: Dmitry Antipov <[email protected]> Closes: https://lore.kernel.org/all/[email protected]/ Signed-off-by: Dan Carpenter <[email protected]> Acked-by: Jeff Layton <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2023-08-24Merge tag 'icc-6.6-rc1' of ↵Greg Kroah-Hartman2-3/+19
git://git.kernel.org/pub/scm/linux/kernel/git/djakov/icc into char-misc-next Georgi writes: interconnect changes for 6.6 This pull request contains the interconnect changes for the 6.6-rc1 merge window which is a mix of core and driver changes with the following highlights: Core changes: - New generic test client driver that allows issuing bandwidth requests between endpoints via debugfs. - Annotate all structs with flexible array members with the __counted_by attribute. - Introduce new icc_bw_lock for cases where we need to serialize bandwidth aggregation and update to decouple that from paths that require memory allocation. Driver changes: - Move the Qualcomm SMD RPM bus-clocks from CCF to interconnect framework where they actually belong. This brings power management improvements and reduces the overhead and layering. These changes are in immutable branch that is being pulled also into the qcom tree. - Fixes for QUP nodes on SM8250. - Enable sync_state and keepalive for QCM2290. - Enable sync_state for SM8450. - Improve enable_mask-based BCMs handling and fix some bugs. - Add compatible string for the OSM-L3 on SDM670. - Add compatible strings for SC7180, SM8250 and SM6350 bandwidth monitors. - Expand and retire the DEFINE_QNODE and DEFINE_QBCM macros, which have become ugly beasts with many different arguments. Signed-off-by: Georgi Djakov <[email protected]> * tag 'icc-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/djakov/icc: (64 commits) interconnect: Add debugfs test client interconnect: Reintroduce icc_get() debugfs: Add write support to debugfs_create_str() interconnect: qcom: icc-rpmh: Retire DEFINE_QBCM interconnect: qcom: sm8350: Retire DEFINE_QBCM interconnect: qcom: sm8250: Retire DEFINE_QBCM interconnect: qcom: sm8150: Retire DEFINE_QBCM interconnect: qcom: sm6350: Retire DEFINE_QBCM interconnect: qcom: sdx65: Retire DEFINE_QBCM interconnect: qcom: sdx55: Retire DEFINE_QBCM interconnect: qcom: sdm845: Retire DEFINE_QBCM interconnect: qcom: sdm670: Retire DEFINE_QBCM interconnect: qcom: sc7180: Retire DEFINE_QBCM interconnect: qcom: icc-rpmh: Retire DEFINE_QNODE interconnect: qcom: sm8350: Retire DEFINE_QNODE interconnect: qcom: sm8250: Retire DEFINE_QNODE interconnect: qcom: sm8150: Retire DEFINE_QNODE interconnect: qcom: sm6350: Retire DEFINE_QNODE interconnect: qcom: sdx65: Retire DEFINE_QNODE interconnect: qcom: sdx55: Retire DEFINE_QNODE ...
2023-08-24libceph: add CEPH_OSD_OP_ASSERT_VER supportJeff Layton2-1/+9
...and record the user_version in the reply in a new field in ceph_osd_request, so we can populate the assert_ver appropriately. Shuffle the fields a bit too so that the new field fits in an existing hole on x86_64. Signed-off-by: Jeff Layton <[email protected]> Reviewed-by: Xiubo Li <[email protected]> Reviewed-and-tested-by: Luís Henriques <[email protected]> Reviewed-by: Milind Changire <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>