aboutsummaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)AuthorFilesLines
2024-05-08bcachefs: Move BCACHEFS_STATFS_MAGIC value to UAPI magic.hPetr Vorel1-0/+1
Move BCACHEFS_STATFS_MAGIC value to UAPI <linux/magic.h> under BCACHEFS_SUPER_MAGIC definition (use common approach for name) and reuse the definition in bcachefs_format.h BCACHEFS_STATFS_MAGIC. There are other bcachefs magic definitions: BCACHE_MAGIC, BCHFS_MAGIC, which use UUID_INIT() and are used only in libbcachefs. Therefore move only BCACHEFS_STATFS_MAGIC value, which can be used outside of libbcachefs for f_type field in struct statfs in statfs() or fstatfs(). Suggested-by: Su Yue <[email protected]> Signed-off-by: Petr Vorel <[email protected]> Acked-by: Brian Foster <[email protected]> Signed-off-by: Kent Overstreet <[email protected]>
2024-05-08closures: closure_sync_timeout()Kent Overstreet1-0/+12
Add a new variant of closure_sync_timeout() that takes a timeout. Note that when this returns -ETIME the closure will still be waiting on something, i.e. it's not safe to return if you've got a stack allocated closure. Signed-off-by: Kent Overstreet <[email protected]>
2024-05-08PCI/CXL: Add 'cxl_bus' reset method for devices below CXL PortsDave Jiang1-1/+1
By default Secondary Bus Reset (SBR) is masked for CXL Ports (see CXL r3.1, sec 8.1.5.2). Add cxl_reset_bus_function() (method "cxl_bus") to set the "Unmask SBR" bit in the upstream CXL Port before performing the bus reset and restore the original value afterwards. This method allows the user to perform a bus reset on a CXL device without needing to set the "Unmask SBR" bit via a user tool. Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Dave Jiang <[email protected]> [bhelgaas: simplify commit log, invert condition to avoid negation] Signed-off-by: Bjorn Helgaas <[email protected]> Reviewed-by: Jonathan Cameron <[email protected]> Reviewed-by: Dan Williams <[email protected]>
2024-05-08PCI/CXL: Fail bus reset if upstream CXL Port has SBR maskedDave Jiang1-0/+5
Per CXL spec r3.1, sec 8.1.5.2, the Secondary Bus Reset (SBR) bit in the Bridge Control register of a CXL port has no effect unless the "Unmask SBR" bit is set. Return -ENOTTY if we attempt a bus reset on a device below a CXL Port where "Unmask SBR" is 0. Otherwise, the bus reset would appear to have succeeded even though setting the bridge SBR bit had no effect. Link: https://lore.kernel.org/linux-cxl/20240220203956.GA1502351@bhelgaas/ Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Dave Jiang <[email protected]> [bhelgaas: simplify commit log and comments] Signed-off-by: Bjorn Helgaas <[email protected]> Reviewed-by: Jonathan Cameron <[email protected]> Reviewed-by: Kuppuswamy Sathyanarayanan <[email protected]> Reviewed-by: Dan Williams <[email protected]>
2024-05-08Merge 6.9-rc7 into char-misc-testingGreg Kroah-Hartman25-121/+255
We need the char-misc changes in here as well. Signed-off-by: Greg Kroah-Hartman <[email protected]>
2024-05-08PCI: Lock upstream bridge for pci_reset_function()Dave Jiang2-0/+7
Fix a long-standing locking gap for missing pci_cfg_access_lock() while manipulating bridge reset registers and configuration during pci_reset_bus_function(). If there is an upstream bridge, lock it before locking the device itself. pci_dev_lock() calls pci_cfg_access_lock(), which blocks the writing of PCI config space by user space. Add lockdep assertion via pci_dev->cfg_access_lock to verify pci_dev->block_cfg_access is set. Co-developed-by: Dan Williams <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Dan Williams <[email protected]> Signed-off-by: Dave Jiang <[email protected]> [bhelgaas: commit log] Signed-off-by: Bjorn Helgaas <[email protected]>
2024-05-08PCI/CXL: Move CXL Vendor ID to pci_ids.hDave Jiang1-0/+2
Move PCI_DVSEC_VENDOR_ID_CXL in CXL private code to PCI_VENDOR_ID_CXL in pci_ids.h in order to be utilized in PCI subsystem. While the CXL Vendor ID (0x1e98) is not listed in the PCI SIG "Member Companies" database at https://pcisig.com/membership/member-companies, the SIG has confirmed that it is reserved by CXL. Link: https://lore.kernel.org/r/[email protected] Suggested-by: Bjorn Helgaas <[email protected]> Link: https://lore.kernel.org/linux-cxl/20240402172323.GA1818777@bhelgaas/ Signed-off-by: Dave Jiang <[email protected]> [bhelgaas: update commit log] Signed-off-by: Bjorn Helgaas <[email protected]> Reviewed-by: Kuppuswamy Sathyanarayanan <[email protected]> Reviewed-by: Dan Williams <[email protected]>
2024-05-08fs/coredump: Enable dynamic configuration of max file note sizeAllen Pais1-0/+2
Introduce the capability to dynamically configure the maximum file note size for ELF core dumps via sysctl. Why is this being done? We have observed that during a crash when there are more than 65k mmaps in memory, the existing fixed limit on the size of the ELF notes section becomes a bottleneck. The notes section quickly reaches its capacity, leading to incomplete memory segment information in the resulting coredump. This truncation compromises the utility of the coredumps, as crucial information about the memory state at the time of the crash might be omitted. This enhancement removes the previous static limit of 4MB, allowing system administrators to adjust the size based on system-specific requirements or constraints. Eg: $ sysctl -a | grep core_file_note_size_limit kernel.core_file_note_size_limit = 4194304 $ sysctl -n kernel.core_file_note_size_limit 4194304 $echo 519304 > /proc/sys/kernel/core_file_note_size_limit $sysctl -n kernel.core_file_note_size_limit 519304 Attempting to write beyond the ceiling value of 16MB $echo 17194304 > /proc/sys/kernel/core_file_note_size_limit bash: echo: write error: Invalid argument Signed-off-by: Vijay Nag <[email protected]> Signed-off-by: Allen Pais <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Kees Cook <[email protected]>
2024-05-08Merge branch 'topic/hda-config-pm-cleanup' into for-nextTakashi Iwai38-77/+212
Pull HD-audio CONFIG_PM cleanup. Signed-off-by: Takashi Iwai <[email protected]>
2024-05-08ALSA: hda: Add Intel BMG PCI ID and HDMI codec vidChaitanya Kumar Borah2-0/+2
Add HD Audio PCI ID and HDMI codec vendor ID for Intel Battlemage. Signed-off-by: Chaitanya Kumar Borah <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Takashi Iwai <[email protected]>
2024-05-08ALSA: hda: codec: Reduce CONFIG_PM dependenciesTakashi Iwai1-11/+0
CONFIG_PM is almost mandatory nowadays for real systems, but we have lots of CONFIG_PM dependent code in snd-hda-codec helper code. Let's reduce the dependencies of CONFIG_PM now. The only visible drawback would be a couple of superfluous trace entries for runtime PM, but we can live with that. Signed-off-by: Takashi Iwai <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-05-08Merge branch kvm-arm64/misc-6.10 into kvmarm-master/nextMarc Zyngier1-1/+1
* kvm-arm64/misc-6.10: : . : Misc fixes and updates targeting 6.10 : : - Improve boot-time diagnostics when the sysreg tables : are not correctly sorted : : - Allow FFA_MSG_SEND_DIRECT_REQ in the FFA proxy : : - Fix duplicate XNX field in the ID_AA64MMFR1_EL1 : writeable mask : : - Allocate PPIs and SGIs outside of the vcpu structure, allowing : for smaller EL2 mapping and some flexibility in implementing : more or less than 32 private IRQs. : : - Use bitmap_gather() instead of its open-coded equivalent : : - Make protected mode use hVHE if available : : - Purge stale mpidr_data if a vcpu is created after the MPIDR : map has been created : . KVM: arm64: Destroy mpidr_data for 'late' vCPU creation KVM: arm64: Use hVHE in pKVM by default on CPUs with VHE support KVM: arm64: Fix hvhe/nvhe early alias parsing KVM: arm64: Convert kvm_mpidr_index() to bitmap_gather() KVM: arm64: vgic: Allocate private interrupts on demand KVM: arm64: Remove duplicated AA64MMFR1_EL1 XNX KVM: arm64: Remove FFA_MSG_SEND_DIRECT_REQ from the denylist KVM: arm64: Improve out-of-order sysreg table diagnostics Signed-off-by: Marc Zyngier <[email protected]>
2024-05-08watchdog: allow nmi watchdog to use raw perf eventSong Liu1-0/+2
NMI watchdog permanently consumes one hardware counters per CPU on the system. For systems that use many hardware counters, this causes more aggressive time multiplexing of perf events. OTOH, some CPUs (mostly Intel) support "ref-cycles" event, which is rarely used. Add kernel cmdline arg nmi_watchdog=rNNN to configure the watchdog to use raw event. For example, on Intel CPUs, we can use "r300" to configure the watchdog to use ref-cycles event. If the raw event does not work, fall back to use "cycles". [[email protected]: fix kerneldoc] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Song Liu <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: "Matthew Wilcox (Oracle)" <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-05-08kfifo: don't use "proxy" headersAndy Shevchenko1-2/+7
Update header inclusions to follow IWYU (Include What You Use) principle. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Andy Shevchenko <[email protected]> Cc: Alain Volmat <[email protected]> Cc: AngeloGioacchino Del Regno <[email protected]> Cc: Chen-Yu Tsai <[email protected]> Cc: Hans Verkuil <[email protected]> Cc: Jernej Skrabec <[email protected]> Cc: Matthias Brugger <[email protected]> Cc: Mauro Carvalho Chehab <[email protected]> Cc: Patrice Chotard <[email protected]> Cc: Rob Herring <[email protected]> Cc: Samuel Holland <[email protected]> Cc: Sean Wang <[email protected]> Cc: Sean Young <[email protected]> Cc: Stefani Seibold <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-05-08kexec: fix the unexpected kexec_dprintk() macroBaoquan He1-4/+2
Jiri reported that the current kexec_dprintk() always prints out debugging message whenever kexec/kdmmp loading is triggered. That is not wanted. The debugging message is supposed to be printed out when 'kexec -s -d' is specified for kexec/kdump loading. After investigating, the reason is the current kexec_dprintk() takes printk(KERN_INFO) or printk(KERN_DEBUG) depending on whether '-d' is specified. However, distros usually have defaulg log level like below: [~]# cat /proc/sys/kernel/printk 7 4 1 7 So, even though '-d' is not specified, printk(KERN_DEBUG) also always prints out. I thought printk(KERN_DEBUG) is equal to pr_debug(), it's not. Fix it by changing to use pr_info() instead which are expected to work. Link: https://lkml.kernel.org/r/[email protected] Fixes: cbc2fe9d9cb2 ("kexec_file: add kexec_file flag to control debug printing") Signed-off-by: Baoquan He <[email protected]> Reported-by: Jiri Slaby <[email protected]> Closes: https://lore.kernel.org/all/[email protected] Reviewed-by: Dave Young <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-05-08cpumask: delete unused reset_cpu_possible_mask()Alexey Dobriyan1-5/+0
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Alexey Dobriyan <[email protected]> Cc: Rasmus Villemoes <[email protected]> Cc: Yury Norov <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-05-08drm: deprecate driver dateJani Nikula1-1/+1
The driver date serves no useful purpose, because it's hardly ever updated. The information is misleading at best. As described in Documentation/gpu/drm-internals.rst: The driver date, formatted as YYYYMMDD, is meant to identify the date of the latest modification to the driver. However, as most drivers fail to update it, its value is mostly useless. The DRM core prints it to the kernel log at initialization time and passes it to userspace through the DRM_IOCTL_VERSION ioctl. Stop printing the driver date at init, and start returning the empty string "" as driver date through the DRM_IOCTL_VERSION ioctl. The driver date initialization in drivers and the struct drm_driver date member can be removed in follow-up. Reviewed-by: Hamza Mahfooz <[email protected]> Acked-by: Simon Ser <[email protected]> Reviewed-by: Javier Martinez Canillas <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] Signed-off-by: Jani Nikula <[email protected]>
2024-05-08net/ipv4: add tracepoint for icmp_sendPeilin He1-0/+67
Introduce a tracepoint for icmp_send, which can help users to get more detail information conveniently when icmp abnormal events happen. 1. Giving an usecase example: ============================= When an application experiences packet loss due to an unreachable UDP destination port, the kernel will send an exception message through the icmp_send function. By adding a trace point for icmp_send, developers or system administrators can obtain detailed information about the UDP packet loss, including the type, code, source address, destination address, source port, and destination port. This facilitates the trouble-shooting of UDP packet loss issues especially for those network-service applications. 2. Operation Instructions: ========================== Switch to the tracing directory. cd /sys/kernel/tracing Filter for destination port unreachable. echo "type==3 && code==3" > events/icmp/icmp_send/filter Enable trace event. echo 1 > events/icmp/icmp_send/enable 3. Result View: ================ udp_client_erro-11370 [002] ...s.12 124.728002: icmp_send: icmp_send: type=3, code=3. From 127.0.0.1:41895 to 127.0.0.1:6666 ulen=23 skbaddr=00000000589b167a Signed-off-by: Peilin He <[email protected]> Signed-off-by: xu xin <[email protected]> Reviewed-by: Yunkai Zhang <[email protected]> Cc: Yang Yang <[email protected]> Cc: Liu Chun <[email protected]> Cc: Xuexin Jiang <[email protected]> Reviewed-by: Steven Rostedt (Google) <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2024-05-08net: dsa: add support switches global DSCP priority mappingOleksij Rempel1-0/+9
Some switches like Microchip KSZ variants do not support per port DSCP priority configuration. Instead there is a global DSCP mapping table. To handle it, we will accept set/del request to any of user ports to make global configuration and update dcb app entries for all other ports. Signed-off-by: Oleksij Rempel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2024-05-08net: add IEEE 802.1q specific helpersOleksij Rempel2-0/+133
IEEE 802.1q specification provides recommendation and examples which can be used as good default values for different drivers. This patch implements mapping examples documented in IEEE 802.1Q-2022 in Annex I "I.3 Traffic type to traffic class mapping" and IETF DSCP naming and mapping DSCP to Traffic Type inspired by RFC8325. This helpers will be used in followup patches for dsa/microchip DCB implementation. Signed-off-by: Oleksij Rempel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2024-05-08net: dsa: add support for DCB get/set apptrust configurationOleksij Rempel1-0/+4
Add DCB support to get/set trust configuration for different packet priority information sources. Some switch allow to chose different source of packet priority classification. For example on KSZ switches it is possible to configure VLAN PCP and/or DSCP sources. Signed-off-by: Oleksij Rempel <[email protected]> Reviewed-by: Florian Fainelli <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2024-05-08RDMA/efa: Support QP with unsolicited write w/ imm. receiveMichael Margolin1-0/+7
Add a new EFA flags attribute for QP creation, and support unsolicited write with immediate flag. QPs created with this flag set will not consume receive work requests for incoming RDMA write with immediate. Expose device capability bit for this feature support. Reviewed-by: Daniel Kranzdorf <[email protected]> Reviewed-by: Firas Jahjah <[email protected]> Signed-off-by: Michael Margolin <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Leon Romanovsky <[email protected]>
2024-05-08uapi: stddef.h: Provide UAPI macros for __counted_by_{le, be}Erick Archer1-0/+8
This commit can be considered an addition to commit ca7e324e8ad3 ("compiler_types: add Endianness-dependent __counted_by_{le,be}") [1]. In the commit referenced above the __counted_by_{le,be}() attributes were defined based on platform's endianness with the goal to that the structures contain flexible arrays at the end, and the counter for, can be annotated with these attributes. So, this commit only provide UAPI macros for UAPI structs that will gain annotations for __counted_by_{le, be} attributes. And it is the previous step to be able to use these attributes in UAPI. Link: https://lore.kernel.org/r/[email protected] Suggested-by: Sven Eckelmann <[email protected]> Signed-off-by: Erick Archer <[email protected]> Link: https://lore.kernel.org/r/AS8PR02MB72372E45071E8821C07236F78BE42@AS8PR02MB7237.eurprd02.prod.outlook.com Fixes: ca7e324e8ad3 ("compiler_types: add Endianness-dependent __counted_by_{le,be}") Signed-off-by: Kees Cook <[email protected]>
2024-05-08xsk: use generic DMA sync shortcut instead of a custom oneAlexander Lobakin2-15/+6
XSk infra's been using its own DMA sync shortcut to try avoiding redundant function calls. Now that there is a generic one, remove the custom implementation and rely on the generic helpers. xsk_buff_dma_sync_for_cpu() doesn't need the second argument anymore, remove it. Signed-off-by: Alexander Lobakin <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2024-05-07net: phy: marvell: add support for MV88E6250 family internal PHYsMatthias Schiffer1-0/+2
The embedded PHYs of the 88E6250 family switches are very basic - they do not even have an Extended Address / Page register. This adds support for the PHYs to the driver to set up PHY interrupts and retrieve error stats. To deal with PHYs without a page register, "simple" variants of all stat handling functions are introduced. The code should work with all 88E6250 family switches (6250/6220/6071/ 6070/6020). The PHY ID 0x01410db0 was read from a 88E6020, under the assumption that all switches of this family use the same ID. The spec only lists the prefix 0x01410c00 and leaves the last 10 bits as reserved, but that seems too unspecific to be useful, as it would cover several existing PHY IDs already supported by the driver; therefore, the ID read from the actual hardware is used. Signed-off-by: Matthias Schiffer <[email protected]> Reviewed-by: Andrew Lunn <[email protected]> Link: https://lore.kernel.org/r/0695f699cd942e6e06da9d30daeedfd47785bc01.1714643285.git.matthias.schiffer@ew.tq-group.com Signed-off-by: Jakub Kicinski <[email protected]>
2024-05-07clock, reset: microchip: move all mpfs reset code to the reset subsystemConor Dooley1-5/+5
Stephen and Philipp, while reviewing patches, said that all of the aux device creation and the register read/write code could be moved to the reset subsystem, leaving the clock driver with no implementations of reset_* functions at all. Move them. Suggested-by: Philipp Zabel <[email protected]> Suggested-by: Stephen Boyd <[email protected]> Signed-off-by: Conor Dooley <[email protected]> Link: https://lore.kernel.org/r/20240424-strangle-sharpener-34755c5e6e3e@spud Signed-off-by: Stephen Boyd <[email protected]>
2024-05-07btrfs: add tracepoints for extent map shrinker eventsFilipe Manana1-0/+99
Add some tracepoints for the extent map shrinker to help debug and analyse main events. These have proved useful during development of the shrinker. Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2024-05-07btrfs: stop referencing btrfs_delayed_tree_ref directlyJosef Bacik1-1/+0
We only ever need to use this to get the level of the tree block ref, so use the btrfs_delayed_ref_owner() helper, which returns the level for the given reference. Reviewed-by: Filipe Manana <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2024-05-07btrfs: stop referencing btrfs_delayed_data_ref directlyJosef Bacik1-1/+0
Now that most of our elements are inside of btrfs_delayed_ref_node directly and we have helpers for the delayed_data_ref bits, go ahead and remove all direct usage of btrfs_delayed_data_ref and use the helpers where needed. Reviewed-by: Filipe Manana <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2024-05-07btrfs: move ->parent and ->ref_root into btrfs_delayed_ref_nodeJosef Bacik1-4/+4
These two members are shared by both the tree refs and data refs, so move them into btrfs_delayed_ref_node proper. This allows us to greatly simplify the comparison code, as the shared refs always only sort on parent, and the non shared refs always sort first on ref_root, and then only data refs sort on their specific fields. Reviewed-by: Filipe Manana <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2024-05-07btrfs: simplify delayed ref tracepointsJosef Bacik1-33/+21
Now that all of the delayed ref information is in the delayed ref node, drastically simplify the delayed ref tracepoints by simply passing in the btrfs_delayed_ref_node and populating the tracepoints with the values from the structure itself. Reviewed-by: Filipe Manana <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2024-05-07btrfs: remove not needed mod_start and mod_len from struct extent_mapFilipe Manana1-2/+1
The mod_start and mod_len fields of struct extent_map were introduced by commit 4e2f84e63dc1 ("Btrfs: improve fsync by filtering extents that we want") in order to avoid too low performance when fsyncing a file that keeps getting extent maps merge, because it resulted in each fsync logging again csum ranges that were already merged before. We don't need this anymore as extent maps in the list of modified extents are never merged with other extent maps and once we log an extent map we remove it from the list of modified extent maps, so it's never logged twice. So remove the mod_start and mod_len fields from struct extent_map and use instead the start and len fields when logging checksums in the fast fsync path. This also makes EXTENT_FLAG_FILLING unused so remove it as well. Running the reproducer from the commit mentioned before, with a larger number of extents and against a null block device, so that IO is fast and we can better see any impact from searching checksums items and logging them, gave the following results from dd: Before this change: 409600000 bytes (410 MB, 391 MiB) copied, 22.948 s, 17.8 MB/s After this change: 409600000 bytes (410 MB, 391 MiB) copied, 22.9997 s, 17.8 MB/s So no changes in throughput. The test was done in a release kernel (non-debug, Debian's default kernel config) and its steps are the following: $ mkfs.btrfs -f /dev/nullb0 $ mount /dev/sdb /mnt $ dd if=/dev/zero of=/mnt/foobar bs=4k count=100000 oflag=sync $ umount /mnt This also reduces the size of struct extent_map from 128 bytes down to 112 bytes, so now we can have 36 extents maps per 4K page instead of 32. Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
2024-05-07kcsan, compiler_types: Introduce __data_racy type qualifierMarco Elver1-0/+7
Based on the discussion at [1], it would be helpful to mark certain variables as explicitly "data racy", which would result in KCSAN not reporting data races involving any accesses on such variables. To do that, introduce the __data_racy type qualifier: struct foo { ... int __data_racy bar; ... }; In KCSAN-kernels, __data_racy turns into volatile, which KCSAN already treats specially by considering them "marked". In non-KCSAN kernels the type qualifier turns into no-op. The generated code between KCSAN-instrumented kernels and non-KCSAN kernels is already huge (inserted calls into runtime for every memory access), so the extra generated code (if any) due to volatile for few such __data_racy variables are unlikely to have measurable impact on performance. Link: https://lore.kernel.org/all/CAHk-=wi3iondeh_9V2g3Qz5oHTRjLsOpoy83hb58MVh=nRZe0A@mail.gmail.com/ [1] Suggested-by: Linus Torvalds <[email protected]> Signed-off-by: Marco Elver <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Tetsuo Handa <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2024-05-07memcg: use proper type for mod_memcg_stateShakeel Butt1-6/+7
The memcg stats update functions can take arbitrary integer but the only input which make sense is enum memcg_stat_item and we don't want these functions to be called with arbitrary integer, so replace the parameter type with enum memcg_stat_item and compiler will be able to warn if memcg stat update functions are called with incorrect index value. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Shakeel Butt <[email protected]> Reviewed-by: T.J. Mercier <[email protected]> Reviewed-by: Roman Gushchin <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Muchun Song <[email protected]> Cc: Yosry Ahmed <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-05-07memcg: dynamically allocate lruvec_statsShakeel Butt1-56/+6
To decouple the dependency of lruvec_stats on NR_VM_NODE_STAT_ITEMS, we need to dynamically allocate lruvec_stats in the mem_cgroup_per_node structure. Also move the definition of lruvec_stats_percpu and lruvec_stats and related functions to the memcontrol.c to facilitate later patches. No functional changes in the patch. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Shakeel Butt <[email protected]> Reviewed-by: Yosry Ahmed <[email protected]> Reviewed-by: T.J. Mercier <[email protected]> Reviewed-by: Roman Gushchin <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Muchun Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-05-07Merge tag 'kvm-riscv-6.10-1' of https://github.com/kvm-riscv/linux into HEADPaolo Bonzini5-18/+70
KVM/riscv changes for 6.10 - Support guest breakpoints using ebreak - Introduce per-VCPU mp_state_lock and reset_cntx_lock - Virtualize SBI PMU snapshot and counter overflow interrupts - New selftests for SBI PMU and Guest ebreak
2024-05-08KVM: PPC: Fix documentation for ppc mmu capsJoel Stanley1-2/+2
The documentation mentions KVM_CAP_PPC_RADIX_MMU, but the defines in the kvm headers spell it KVM_CAP_PPC_MMU_RADIX. Similarly with KVM_CAP_PPC_MMU_HASH_V3. Fixes: c92701322711 ("KVM: PPC: Book3S HV: Add userspace interfaces for POWER9 MMU") Signed-off-by: Joel Stanley <[email protected]> Acked-by: Paul Mackerras <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://msgid.link/[email protected]
2024-05-07HID: do not assume HAT Switch logical max < 8Benjamin Tissoires1-3/+3
Turns out that the code can handle a greater range, but the data stored can not. This is problematic on the Raptor Mach 2 joystick which logical max is 239. The kernel interprets it as `-15` and thus ignores the Hat Switch handling. Link: https://gitlab.freedesktop.org/libevdev/udev-hid-bpf/-/issues/17 Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Peter Hutterer <[email protected]> Signed-off-by: Benjamin Tissoires <[email protected]>
2024-05-07md: Revert "md: Fix overflow in is_mddev_idle"Li Nan1-1/+1
This reverts commit 3f9f231236ce7e48780d8a4f1f8cb9fae2df1e4e. Using 64bit for 'sync_io' is unnecessary from the gendisk side. This overflow will not cause any functional impact, except for a UBSAN warning. Solving this overflow requires introducing additional calculations and checks which are not necessary. So just keep using 32bit for 'sync_io'. Signed-off-by: Li Nan <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-05-07block: add a blk_alloc_discard_bio helperChristoph Hellwig1-0/+3
Factor out a helper from __blkdev_issue_discard that chews off as much as possible from a discard range and allocates a bio for it. Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-05-07block: add a bio_chain_and_submit helperChristoph Hellwig1-0/+1
This is basically blk_next_bio just with the bio allocation moved to the caller to allow for more flexible bio handling in the caller. Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-05-07bug: Improve commentThorsten Blum1-1/+1
Add parentheses to WARN_ON_ONCE() for consistency. Signed-off-by: Thorsten Blum <[email protected]> Signed-off-by: Arnd Bergmann <[email protected]>
2024-05-07ACPI/NUMA: Remove architecture dependent remainingsRobert Richter1-5/+0
With the removal of the Itanium architecture [1] the last architecture dependent functions: acpi_numa_slit_init(), acpi_numa_memory_affinity_init() were removed. Remove its remainings in the header files too and make them static. [1] commit cf8e8658100d ("arch: Remove Itanium (IA-64) architecture") Reviewed-by: Dan Williams <[email protected]> Reviewed-by: Jonathan Cameron <[email protected]> Reviewed-by: Alison Schofield <[email protected]> Signed-off-by: Robert Richter <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2024-05-07x86/numa: Fix SRAT lookup of CFMWS ranges with numa_fill_memblks()Robert Richter1-6/+1
For configurations that have the kconfig option NUMA_KEEP_MEMINFO disabled, numa_fill_memblks() only returns with NUMA_NO_MEMBLK (-1). SRAT lookup fails then because an existing SRAT memory range cannot be found for a CFMWS address range. This causes the addition of a duplicate numa_memblk with a different node id and a subsequent page fault and kernel crash during boot. Fix this by making numa_fill_memblks() always available regardless of NUMA_KEEP_MEMINFO. As Dan suggested, the fix is implemented to remove numa_fill_memblks() from sparsemem.h and alos using __weak for the function. Note that the issue was initially introduced with [1]. But since phys_to_target_node() was originally used that returned the valid node 0, an additional numa_memblk was not added. Though, the node id was wrong too, a message is seen then in the logs: kernel/numa.c: pr_info_once("Unknown target node for memory at 0x%llx, assuming node 0\n", [1] commit fd49f99c1809 ("ACPI: NUMA: Add a node and memblk for each CFMWS not in SRAT") Suggested-by: Dan Williams <[email protected]> Link: https://lore.kernel.org/all/[email protected]/ Fixes: 8f1004679987 ("ACPI/NUMA: Apply SRAT proximity domain to entire CFMWS window") Reviewed-by: Jonathan Cameron <[email protected]> Reviewed-by: Alison Schofield <[email protected]> Reviewed-by: Dan Williams <[email protected]> Signed-off-by: Robert Richter <[email protected]> Acked-by: Borislav Petkov (AMD) <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2024-05-07page_pool: don't use driver-set flags field directlyAlexander Lobakin1-3/+10
page_pool::p is driver-defined params, copied directly from the structure passed to page_pool_create(). The structure isn't meant to be modified by the Page Pool core code and this even might look confusing[0][1]. In order to be able to alter some flags, let's define our own, internal fields the same way as the already existing one (::has_init_callback). They are defined as bits in the driver-set params, leave them so here as well, to not waste byte-per-bit or so. Almost 30 bits are still free for future extensions. We could've defined only new flags here or only the ones we may need to alter, but checking some flags in one place while others in another doesn't sound convenient or intuitive. ::flags passed by the driver can now go to the "slow" PP params. Suggested-by: Jakub Kicinski <[email protected]> Link[0]: https://lore.kernel.org/netdev/[email protected] Suggested-by: Alexander Duyck <[email protected]> Link[1]: https://lore.kernel.org/netdev/CAKgT0UfZCGnWgOH96E4GV3ZP6LLbROHM7SHE8NKwq+exX+Gk_Q@mail.gmail.com Signed-off-by: Alexander Lobakin <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2024-05-07page_pool: make sure frag API fields don't span between cachelinesAlexander Lobakin1-1/+11
After commit 5027ec19f104 ("net: page_pool: split the page_pool_params into fast and slow") that made &page_pool contain only "hot" params at the start, cacheline boundary chops frag API fields group in the middle again. To not bother with this each time fast params get expanded or shrunk, let's just align them to `4 * sizeof(long)`, the closest upper pow-2 to their actual size (2 longs + 1 int). This ensures 16-byte alignment for the 32-bit architectures and 32-byte alignment for the 64-bit ones, excluding unnecessary false-sharing. ::page_state_hold_cnt is used quite intensively on hotpath no matter if frag API is used, so move it to the newly created hole in the first cacheline. Signed-off-by: Alexander Lobakin <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2024-05-07dma: avoid redundant calls for sync operationsAlexander Lobakin3-5/+64
Quite often, devices do not need dma_sync operations on x86_64 at least. Indeed, when dev_is_dma_coherent(dev) is true and dev_use_swiotlb(dev) is false, iommu_dma_sync_single_for_cpu() and friends do nothing. However, indirectly calling them when CONFIG_RETPOLINE=y consumes about 10% of cycles on a cpu receiving packets from softirq at ~100Gbit rate. Even if/when CONFIG_RETPOLINE is not set, there is a cost of about 3%. Add dev->need_dma_sync boolean and turn it off during the device initialization (dma_set_mask()) depending on the setup: dev_is_dma_coherent() for the direct DMA, !(sync_single_for_device || sync_single_for_cpu) or the new dma_map_ops flag, %DMA_F_CAN_SKIP_SYNC, advertised for non-NULL DMA ops. Then later, if/when swiotlb is used for the first time, the flag is reset back to on, from swiotlb_tbl_map_single(). On iavf, the UDP trafficgen with XDP_DROP in skb mode test shows +3-5% increase for direct DMA. Suggested-by: Christoph Hellwig <[email protected]> # direct DMA shortcut Co-developed-by: Eric Dumazet <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: Alexander Lobakin <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2024-05-07dma: compile-out DMA sync op calls when not usedAlexander Lobakin1-29/+33
Some platforms do have DMA, but DMA there is always direct and coherent. Currently, even on such platforms DMA sync operations are compiled and called. Add a new hidden Kconfig symbol, DMA_NEED_SYNC, and set it only when either sync operations are needed or there is DMA ops or swiotlb or DMA debug is enabled. Compile global dma_sync_*() and dma_need_sync() only when it's set, otherwise provide empty inline stubs. The change allows for future optimizations of DMA sync calls depending on runtime conditions. Signed-off-by: Alexander Lobakin <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2024-05-07iommu/dma: fix zeroing of bounce buffer padding used by untrusted devicesMichael Kelley1-0/+5
iommu_dma_map_page() allocates swiotlb memory as a bounce buffer when an untrusted device wants to map only part of the memory in an granule. The goal is to disallow the untrusted device having DMA access to unrelated kernel data that may be sharing the granule. To meet this goal, the bounce buffer itself is zeroed, and any additional swiotlb memory up to alloc_size after the bounce buffer end (i.e., "post-padding") is also zeroed. However, as of commit 901c7280ca0d ("Reinstate some of "swiotlb: rework "fix info leak with DMA_FROM_DEVICE"""), swiotlb_tbl_map_single() always initializes the contents of the bounce buffer to the original memory. Zeroing the bounce buffer is redundant and probably wrong per the discussion in that commit. Only the post-padding needs to be zeroed. Also, when the DMA min_align_mask is non-zero, the allocated bounce buffer space may not start on a granule boundary. The swiotlb memory from the granule boundary to the start of the allocated bounce buffer might belong to some unrelated bounce buffer. So as described in the "second issue" in [1], it can't be zeroed to protect against untrusted devices. But as of commit af133562d5af ("swiotlb: extend buffer pre-padding to alloc_align_mask if necessary"), swiotlb_tbl_map_single() allocates pre-padding slots when necessary to meet min_align_mask requirements, making it possible to zero the pre-padding area as well. Finally, iommu_dma_map_page() uses the swiotlb for untrusted devices and also for certain kmalloc() memory. Current code does the zeroing for both cases, but it is needed only for the untrusted device case. Fix all of this by updating iommu_dma_map_page() to zero both the pre-padding and post-padding areas, but not the actual bounce buffer. Do this only in the case where the bounce buffer is used because of an untrusted device. [1] https://lore.kernel.org/all/[email protected]/ Signed-off-by: Michael Kelley <[email protected]> Reviewed-by: Petr Tesarik <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2024-05-07swiotlb: remove alloc_size argument to swiotlb_tbl_map_single()Michael Kelley1-1/+1
Currently swiotlb_tbl_map_single() takes alloc_align_mask and alloc_size arguments to specify an swiotlb allocation that is larger than mapping_size. This larger allocation is used solely by iommu_dma_map_single() to handle untrusted devices that should not have DMA visibility to memory pages that are partially used for unrelated kernel data. Having two arguments to specify the allocation is redundant. While alloc_align_mask naturally specifies the alignment of the starting address of the allocation, it can also implicitly specify the size by rounding up the mapping_size to that alignment. Additionally, the current approach has an edge case bug. iommu_dma_map_page() already does the rounding up to compute the alloc_size argument. But swiotlb_tbl_map_single() then calculates the alignment offset based on the DMA min_align_mask, and adds that offset to alloc_size. If the offset is non-zero, the addition may result in a value that is larger than the max the swiotlb can allocate. If the rounding up is done _after_ the alignment offset is added to the mapping_size (and the original mapping_size conforms to the value returned by swiotlb_max_mapping_size), then the max that the swiotlb can allocate will not be exceeded. In view of these issues, simplify the swiotlb_tbl_map_single() interface by removing the alloc_size argument. Most call sites pass the same value for mapping_size and alloc_size, and they pass alloc_align_mask as zero. Just remove the redundant argument from these callers, as they will see no functional change. For iommu_dma_map_page() also remove the alloc_size argument, and have swiotlb_tbl_map_single() compute the alloc_size by rounding up mapping_size after adding the offset based on min_align_mask. This has the side effect of fixing the edge case bug but with no other functional change. Also add a sanity test on the alloc_align_mask. While IOMMU code currently ensures the granule is not larger than PAGE_SIZE, if that guarantee were to be removed in the future, the downstream effect on the swiotlb might go unnoticed until strange allocation failures occurred. Tested on an ARM64 system with 16K page size and some kernel test-only hackery to allow modifying the DMA min_align_mask and the granule size that becomes the alloc_align_mask. Tested these combinations with a variety of original memory addresses and sizes, including those that reproduce the edge case bug: * 4K granule and 0 min_align_mask * 4K granule and 0xFFF min_align_mask (4K - 1) * 16K granule and 0xFFF min_align_mask * 64K granule and 0xFFF min_align_mask * 64K granule and 0x3FFF min_align_mask (16K - 1) With the changes, all combinations pass. Signed-off-by: Michael Kelley <[email protected]> Reviewed-by: Petr Tesarik <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>