aboutsummaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)AuthorFilesLines
2024-05-07ACPI/NUMA: Remove architecture dependent remainingsRobert Richter1-5/+0
With the removal of the Itanium architecture [1] the last architecture dependent functions: acpi_numa_slit_init(), acpi_numa_memory_affinity_init() were removed. Remove its remainings in the header files too and make them static. [1] commit cf8e8658100d ("arch: Remove Itanium (IA-64) architecture") Reviewed-by: Dan Williams <[email protected]> Reviewed-by: Jonathan Cameron <[email protected]> Reviewed-by: Alison Schofield <[email protected]> Signed-off-by: Robert Richter <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2024-05-07x86/numa: Fix SRAT lookup of CFMWS ranges with numa_fill_memblks()Robert Richter1-6/+1
For configurations that have the kconfig option NUMA_KEEP_MEMINFO disabled, numa_fill_memblks() only returns with NUMA_NO_MEMBLK (-1). SRAT lookup fails then because an existing SRAT memory range cannot be found for a CFMWS address range. This causes the addition of a duplicate numa_memblk with a different node id and a subsequent page fault and kernel crash during boot. Fix this by making numa_fill_memblks() always available regardless of NUMA_KEEP_MEMINFO. As Dan suggested, the fix is implemented to remove numa_fill_memblks() from sparsemem.h and alos using __weak for the function. Note that the issue was initially introduced with [1]. But since phys_to_target_node() was originally used that returned the valid node 0, an additional numa_memblk was not added. Though, the node id was wrong too, a message is seen then in the logs: kernel/numa.c: pr_info_once("Unknown target node for memory at 0x%llx, assuming node 0\n", [1] commit fd49f99c1809 ("ACPI: NUMA: Add a node and memblk for each CFMWS not in SRAT") Suggested-by: Dan Williams <[email protected]> Link: https://lore.kernel.org/all/[email protected]/ Fixes: 8f1004679987 ("ACPI/NUMA: Apply SRAT proximity domain to entire CFMWS window") Reviewed-by: Jonathan Cameron <[email protected]> Reviewed-by: Alison Schofield <[email protected]> Reviewed-by: Dan Williams <[email protected]> Signed-off-by: Robert Richter <[email protected]> Acked-by: Borislav Petkov (AMD) <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2024-05-07dma: avoid redundant calls for sync operationsAlexander Lobakin3-5/+64
Quite often, devices do not need dma_sync operations on x86_64 at least. Indeed, when dev_is_dma_coherent(dev) is true and dev_use_swiotlb(dev) is false, iommu_dma_sync_single_for_cpu() and friends do nothing. However, indirectly calling them when CONFIG_RETPOLINE=y consumes about 10% of cycles on a cpu receiving packets from softirq at ~100Gbit rate. Even if/when CONFIG_RETPOLINE is not set, there is a cost of about 3%. Add dev->need_dma_sync boolean and turn it off during the device initialization (dma_set_mask()) depending on the setup: dev_is_dma_coherent() for the direct DMA, !(sync_single_for_device || sync_single_for_cpu) or the new dma_map_ops flag, %DMA_F_CAN_SKIP_SYNC, advertised for non-NULL DMA ops. Then later, if/when swiotlb is used for the first time, the flag is reset back to on, from swiotlb_tbl_map_single(). On iavf, the UDP trafficgen with XDP_DROP in skb mode test shows +3-5% increase for direct DMA. Suggested-by: Christoph Hellwig <[email protected]> # direct DMA shortcut Co-developed-by: Eric Dumazet <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: Alexander Lobakin <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2024-05-07dma: compile-out DMA sync op calls when not usedAlexander Lobakin1-29/+33
Some platforms do have DMA, but DMA there is always direct and coherent. Currently, even on such platforms DMA sync operations are compiled and called. Add a new hidden Kconfig symbol, DMA_NEED_SYNC, and set it only when either sync operations are needed or there is DMA ops or swiotlb or DMA debug is enabled. Compile global dma_sync_*() and dma_need_sync() only when it's set, otherwise provide empty inline stubs. The change allows for future optimizations of DMA sync calls depending on runtime conditions. Signed-off-by: Alexander Lobakin <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2024-05-07iommu/dma: fix zeroing of bounce buffer padding used by untrusted devicesMichael Kelley1-0/+5
iommu_dma_map_page() allocates swiotlb memory as a bounce buffer when an untrusted device wants to map only part of the memory in an granule. The goal is to disallow the untrusted device having DMA access to unrelated kernel data that may be sharing the granule. To meet this goal, the bounce buffer itself is zeroed, and any additional swiotlb memory up to alloc_size after the bounce buffer end (i.e., "post-padding") is also zeroed. However, as of commit 901c7280ca0d ("Reinstate some of "swiotlb: rework "fix info leak with DMA_FROM_DEVICE"""), swiotlb_tbl_map_single() always initializes the contents of the bounce buffer to the original memory. Zeroing the bounce buffer is redundant and probably wrong per the discussion in that commit. Only the post-padding needs to be zeroed. Also, when the DMA min_align_mask is non-zero, the allocated bounce buffer space may not start on a granule boundary. The swiotlb memory from the granule boundary to the start of the allocated bounce buffer might belong to some unrelated bounce buffer. So as described in the "second issue" in [1], it can't be zeroed to protect against untrusted devices. But as of commit af133562d5af ("swiotlb: extend buffer pre-padding to alloc_align_mask if necessary"), swiotlb_tbl_map_single() allocates pre-padding slots when necessary to meet min_align_mask requirements, making it possible to zero the pre-padding area as well. Finally, iommu_dma_map_page() uses the swiotlb for untrusted devices and also for certain kmalloc() memory. Current code does the zeroing for both cases, but it is needed only for the untrusted device case. Fix all of this by updating iommu_dma_map_page() to zero both the pre-padding and post-padding areas, but not the actual bounce buffer. Do this only in the case where the bounce buffer is used because of an untrusted device. [1] https://lore.kernel.org/all/[email protected]/ Signed-off-by: Michael Kelley <[email protected]> Reviewed-by: Petr Tesarik <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2024-05-07swiotlb: remove alloc_size argument to swiotlb_tbl_map_single()Michael Kelley1-1/+1
Currently swiotlb_tbl_map_single() takes alloc_align_mask and alloc_size arguments to specify an swiotlb allocation that is larger than mapping_size. This larger allocation is used solely by iommu_dma_map_single() to handle untrusted devices that should not have DMA visibility to memory pages that are partially used for unrelated kernel data. Having two arguments to specify the allocation is redundant. While alloc_align_mask naturally specifies the alignment of the starting address of the allocation, it can also implicitly specify the size by rounding up the mapping_size to that alignment. Additionally, the current approach has an edge case bug. iommu_dma_map_page() already does the rounding up to compute the alloc_size argument. But swiotlb_tbl_map_single() then calculates the alignment offset based on the DMA min_align_mask, and adds that offset to alloc_size. If the offset is non-zero, the addition may result in a value that is larger than the max the swiotlb can allocate. If the rounding up is done _after_ the alignment offset is added to the mapping_size (and the original mapping_size conforms to the value returned by swiotlb_max_mapping_size), then the max that the swiotlb can allocate will not be exceeded. In view of these issues, simplify the swiotlb_tbl_map_single() interface by removing the alloc_size argument. Most call sites pass the same value for mapping_size and alloc_size, and they pass alloc_align_mask as zero. Just remove the redundant argument from these callers, as they will see no functional change. For iommu_dma_map_page() also remove the alloc_size argument, and have swiotlb_tbl_map_single() compute the alloc_size by rounding up mapping_size after adding the offset based on min_align_mask. This has the side effect of fixing the edge case bug but with no other functional change. Also add a sanity test on the alloc_align_mask. While IOMMU code currently ensures the granule is not larger than PAGE_SIZE, if that guarantee were to be removed in the future, the downstream effect on the swiotlb might go unnoticed until strange allocation failures occurred. Tested on an ARM64 system with 16K page size and some kernel test-only hackery to allow modifying the DMA min_align_mask and the granule size that becomes the alloc_align_mask. Tested these combinations with a variety of original memory addresses and sizes, including those that reproduce the edge case bug: * 4K granule and 0 min_align_mask * 4K granule and 0xFFF min_align_mask (4K - 1) * 16K granule and 0xFFF min_align_mask * 64K granule and 0xFFF min_align_mask * 64K granule and 0x3FFF min_align_mask (16K - 1) With the changes, all combinations pass. Signed-off-by: Michael Kelley <[email protected]> Reviewed-by: Petr Tesarik <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2024-05-07printk: cleanup deprecated uses of strncpy/strcpyJustin Stitt1-1/+1
Cleanup some deprecated uses of strncpy() and strcpy() [1]. There doesn't seem to be any bugs with the current code but the readability of this code could benefit from a quick makeover while removing some deprecated stuff as a benefit. The most interesting replacement made in this patch involves concatenating "ttyS" with a digit-led user-supplied string. Instead of doing two distinct string copies with carefully managed offsets and lengths, let's use the more robust and self-explanatory scnprintf(). scnprintf will 1) respect the bounds of @buf, 2) null-terminate @buf, 3) do the concatenation. This allows us to drop the manual NUL-byte assignment. Also, since isdigit() is used about a dozen lines after the open-coded version we'll replace it for uniformity's sake. All the strcpy() --> strscpy() replacements are trivial as the source strings are literals and much smaller than the destination size. No behavioral change here. Use the new 2-argument version of strscpy() introduced in Commit e6584c3964f2f ("string: Allow 2-argument strscpy()"). However, to make this work fully (since the size must be known at compile time), also update the extern-qualified declaration to have the proper size information. Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1] Link: https://github.com/KSPP/linux/issues/90 [2] Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [3] Cc: [email protected] Signed-off-by: Justin Stitt <[email protected]> Reviewed-by: Petr Mladek <[email protected]> Link: https://lore.kernel.org/r/20240429-strncpy-kernel-printk-printk-c-v1-1-4da7926d7b69@google.com [[email protected]: Removed obsolete brackets and added empty lines.] Signed-off-by: Petr Mladek <[email protected]>
2024-05-07gpiolib: Discourage to use formatting strings in line namesAndy Shevchenko1-3/+1
Currently the documentation for line names allows to use %u inside the alternative name. This is broken in character device approach from day 1 and being in use solely in sysfs. Character device interface has a line number as a part of its address, so the users better rely on it. Hence remove the misleading documentation. On top of that, there are no in-kernel users (out of 6, if I'm correct) for such names and moreover if one exists it won't help in distinguishing lines with the same naming as '%u' will also be in them and we will get a warning in gpiochip_set_desc_names() for such cases. Signed-off-by: Andy Shevchenko <[email protected]> Reviewed-by: Linus Walleij <[email protected]> Reviewed-by: Kent Gibson <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Bartosz Golaszewski <[email protected]>
2024-05-07Merge tag 'intel-gpio-v6.10-1' of ↵Bartosz Golaszewski1-4/+4
git://git.kernel.org/pub/scm/linux/kernel/git/andy/linux-gpio-intel into gpio/for-next intel-gpio for v6.10-1 * New driver for vGPIO controller on Intel Granite Rapids-D * Update ACPI GPIO library to unify the IRQ code path * Better GPIO IRQ line labeling for ACPI * Switched Intel SCH driver to use "mapped" I/O accessors The following is an automated git shortlog grouped by driver: Add Intel Granite Rapids-D vGPIO driver: - Add Intel Granite Rapids-D vGPIO driver crystalcove: - Use -ENOTSUPP consistently gpiolib: - acpi: Set label for IRQ only lines - acpi: Add fwnode name to the GPIO interrupt label - acpi: Pass con_id instead of property into acpi_dev_gpio_irq_get_by() - acpi: Move acpi_can_fallback_to_crs() out of __acpi_find_gpio() - acpi: Simplify error handling in __acpi_find_gpio() - acpi: Extract __acpi_find_gpio() helper - acpi: Check for errors first in acpi_find_gpio() - acpi: Remove never true check in acpi_get_gpiod_by_index() sch: - Utilise temporary variable for struct device - Switch to memory mapped IO accessors wcove: - Use -ENOTSUPP consistently
2024-05-07regmap: Reorder fields in 'struct regmap_config' to save some memoryChristophe JAILLET1-31/+31
On x86_64 and allmodconfig, this shrinks the size of 'struct regmap_config' from 328 to 312 bytes. This is usually a win, because this structure is used as a static global variable. When moving the kerneldoc fields, I've tried to keep the layout as consistent as possible, which is not really easy! Before: /* size: 328, cachelines: 6, members: 55 */ /* sum members: 296, holes: 6, sum holes: 25 */ /* padding: 7 */ /* last cacheline: 8 bytes */ After: /* size: 312, cachelines: 5, members: 55 */ /* sum members: 296, holes: 5, sum holes: 16 */ /* last cacheline: 56 bytes */ For the records, this is also widely used: $git grep static.*regmap_config | wc -l 1327 Signed-off-by: Christophe JAILLET <[email protected]> Link: https://lore.kernel.org/r/5e039cd8fe415dd7ab3169948c08a5311db9fb9a.1715024007.git.christophe.jaillet@wanadoo.fr Signed-off-by: Mark Brown <[email protected]>
2024-05-06Reapply "drm/qxl: simplify qxl_fence_wait"Linus Torvalds1-7/+0
This reverts commit 07ed11afb68d94eadd4ffc082b97c2331307c5ea. Stephen Rostedt reports: "I went to run my tests on my VMs and the tests hung on boot up. Unfortunately, the most I ever got out was: [ 93.607888] Testing event system initcall: OK [ 93.667730] Running tests on all trace events: [ 93.669757] Testing all events: OK [ 95.631064] ------------[ cut here ]------------ Timed out after 60 seconds" and further debugging points to a possible circular locking dependency between the console_owner locking and the worker pool locking. Reverting the commit allows Steve's VM to boot to completion again. [ This may obviously result in the "[TTM] Buffer eviction failed" messages again, which was the reason for that original revert. But at this point this seems preferable to a non-booting system... ] Reported-and-bisected-by: Steven Rostedt <[email protected]> Link: https://lore.kernel.org/all/[email protected]/ Acked-by: Maxime Ripard <[email protected]> Cc: Alex Constantino <[email protected]> Cc: Maxime Ripard <[email protected]> Cc: Timo Lindfors <[email protected]> Cc: Dave Airlie <[email protected]> Cc: Gerd Hoffmann <[email protected]> Cc: Maarten Lankhorst <[email protected]> Cc: Thomas Zimmermann <[email protected]> Cc: Daniel Vetter <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2024-05-06Merge tag 'slab-for-6.9-rc7-fixes' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab fixes from Vlastimil Babka: - Fix for cleanup infrastructure (Dan Carpenter) This makes the __free(kfree) cleanup hooks not crash on error pointers. - SLUB fix for freepointer checking (Nicolas Bouchinet) This fixes a recently introduced bug that manifests when init_on_free, CONFIG_SLAB_FREELIST_HARDENED and consistency checks (slub_debug=F) are all enabled, and results in false-positive freepointer corrupt reports for caches that store freepointer outside of the object area. * tag 'slab-for-6.9-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: mm/slab: make __free(kfree) accept error pointers mm/slub: avoid zeroing outside-object freepointer for single free
2024-05-06printk: Change type of CONFIG_BASE_SMALL to boolYoann Congal3-4/+4
CONFIG_BASE_SMALL is currently a type int but is only used as a boolean. So, change its type to bool and adapt all usages: CONFIG_BASE_SMALL == 0 becomes !IS_ENABLED(CONFIG_BASE_SMALL) and CONFIG_BASE_SMALL != 0 becomes IS_ENABLED(CONFIG_BASE_SMALL). Reviewed-by: Petr Mladek <[email protected]> Reviewed-by: Greg Kroah-Hartman <[email protected]> Reviewed-by: Masahiro Yamada <[email protected]> Signed-off-by: Yoann Congal <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Petr Mladek <[email protected]>
2024-05-06SUNRPC: add a new svc_find_listener helperJeff Layton1-0/+2
svc_find_listener will return the transport instance pointer for the endpoint accepting connections/peer traffic from the specified transport class and matching sockaddr. Signed-off-by: Jeff Layton <[email protected]> Signed-off-by: Lorenzo Bianconi <[email protected]> Signed-off-by: Chuck Lever <[email protected]>
2024-05-06SUNRPC: introduce svc_xprt_create_from_sa utility routineLorenzo Bianconi1-0/+3
Add svc_xprt_create_from_sa utility routine and refactor svc_xprt_create() codebase in order to introduce the capability to create a svc port from socket address. Reviewed-by: Jeff Layton <[email protected]> Tested-by: Jeff Layton <[email protected]> Signed-off-by: Lorenzo Bianconi <[email protected]> Signed-off-by: Chuck Lever <[email protected]>
2024-05-06nfsd: trivial GET_DIR_DELEGATION supportJeff Layton1-0/+6
This adds basic infrastructure for handing GET_DIR_DELEGATION calls from clients, including the decoders and encoders. For now, it always just returns NFS4_OK + GDD4_UNAVAIL. Eventually clients may start sending this operation, and it's better if we can return GDD4_UNAVAIL instead of having to abort the whole compound. Signed-off-by: Jeff Layton <[email protected]> Signed-off-by: Chuck Lever <[email protected]>
2024-05-06Merge back thermal cotntrol material for v6.10.Rafael J. Wysocki1-107/+2
2024-05-06alpha: drop pre-EV56 supportArnd Bergmann2-16/+4
All EV4 machines are already gone, and the remaining EV5 based machines all support the slightly more modern EV56 generation as well. Debian only supports EV56 and later. Drop both of these and build kernels optimized for EV56 and higher when the "generic" options is selected, tuning for an out-of-order EV6 pipeline, same as Debian userspace. Since this was the only supported architecture without 8-bit and 16-bit stores, common kernel code no longer has to worry about aligning struct members, and existing workarounds from the block and tty layers can be removed. The alpha memory management code no longer needs an abstraction for the differences between EV4 and EV5+. Link: https://lists.debian.org/debian-alpha/2023/05/msg00009.html Acked-by: Paul E. McKenney <[email protected]> Acked-by: Matt Turner <[email protected]> Signed-off-by: Arnd Bergmann <[email protected]>
2024-05-06firewire: Annotate struct fw_iso_packet with __counted_by()Gustavo A. R. Silva1-1/+2
Prepare for the coming implementation by GCC and Clang of the __counted_by attribute. Flexible array members annotated with __counted_by can have their accesses bounds-checked at run-time via CONFIG_UBSAN_BOUNDS (for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family functions). Signed-off-by: Gustavo A. R. Silva <[email protected]> Link: https://lore.kernel.org/r/ZgIrOuR3JI/jzqoH@neat Signed-off-by: Takashi Sakamoto <[email protected]>
2024-05-06regulator: new API for voltage reference suppliesMark Brown22-42/+168
Merge series from David Lechner <[email protected]>: In the IIO subsystem, we noticed a pattern in many drivers where we need to get, enable and get the voltage of a supply that provides a reference voltage. In these cases, we only need the voltage and not a handle to the regulator. Another common pattern is for chips to have an internal reference voltage that is used when an external reference is not available. There are also a few drivers outside of IIO that do the same. So we would like to propose a new regulator consumer API to handle these specific cases to avoid repeating the same boilerplate code in multiple drivers. As an example of how these functions are used, I have included a few patches to consumer drivers. But to avoid a giant patch bomb, I have omitted the iio/adc and iio/dac patches I have prepared from this series. I will send those separately but these will add 36 more users of devm_regulator_get_enable_read_voltage() in addition to the 6 here. In total, this will eliminate nearly 1000 lines of similar code and will simplify writing and reviewing new drivers in the future.
2024-05-05mm/pagemap: make trylock_page return boolHao Ge1-1/+1
Make trylock_page return bool to align the return values of folio_trylock function and it also corresponds to its comment. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Hao Ge <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-05-05mm/damon: add DAMOS filter type YOUNGSeongJae Park1-0/+2
Define yet another DAMOS filter type, YOUNG. Like anon and memcg, the type of filter will be applied to each page in the memory region, and see if the page is accessed since the last check. Based on the 'matching' parameter, the page is filtered out or in. Note that this commit is adding only the type definition. The implementation should be made by DAMON operations sets. A commit for the implementation on 'paddr' DAMON operations set will follow. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Tested-by: Honggyu Kim <[email protected]> Cc: Jonathan Corbet <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-05-05mm: simplify thp_vma_allowable_orderMatthew Wilcox1-14/+15
Combine the three boolean arguments into one flags argument for readability. Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Ryan Roberts <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-05-05writeback: support retrieving per group debug writeback stats of bdiKemeng Shi1-0/+1
Add /sys/kernel/debug/bdi/xxx/wb_stats to show per group writeback stats of bdi. Following domain hierarchy is tested: global domain (320G) / \ cgroup domain1(10G) cgroup domain2(10G) | | bdi wb1 wb2 /* per wb writeback info of bdi is collected */ cat wb_stats WbCgIno: 1 WbWriteback: 0 kB WbReclaimable: 0 kB WbDirtyThresh: 0 kB WbDirtied: 0 kB WbWritten: 0 kB WbWriteBandwidth: 102400 kBps b_dirty: 0 b_io: 0 b_more_io: 0 b_dirty_time: 0 state: 1 WbCgIno: 4091 WbWriteback: 1792 kB WbReclaimable: 820512 kB WbDirtyThresh: 6004692 kB WbDirtied: 1820448 kB WbWritten: 999488 kB WbWriteBandwidth: 169020 kBps b_dirty: 0 b_io: 0 b_more_io: 1 b_dirty_time: 0 state: 5 WbCgIno: 4131 WbWriteback: 1120 kB WbReclaimable: 820064 kB WbDirtyThresh: 6004728 kB WbDirtied: 1822688 kB WbWritten: 1002400 kB WbWriteBandwidth: 153520 kBps b_dirty: 0 b_io: 0 b_more_io: 1 b_dirty_time: 0 state: 5 [[email protected]: fix build problems] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kemeng Shi <[email protected]> Cc: Brian Foster <[email protected]> Cc: David Howells <[email protected]> Cc: David Sterba <[email protected]> Cc: Jan Kara <[email protected]> Cc: Mateusz Guzik <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: SeongJae Park <[email protected]> Cc: Stephen Rothwell <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-05-05mm: remove PageReferencedMatthew Wilcox (Oracle)1-3/+3
All callers now use folio_*_referenced() so we can remove the PageReferenced family of functions. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-05-05mm: remove page_ref_sub_return()Matthew Wilcox (Oracle)1-8/+3
With all callers converted to folios, we can act directly on folio->_refcount. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-05-05mm: convert put_devmap_managed_page_refs() to put_devmap_managed_folio_refs()Matthew Wilcox (Oracle)1-6/+6
All callers have a folio so we can remove this use of page_ref_sub_return(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-05-05mm: remove put_devmap_managed_page()Matthew Wilcox (Oracle)1-6/+1
It only has one caller; convert that caller to use put_devmap_managed_page_refs() instead. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-05-05mm: remove page_cache_alloc()Matthew Wilcox (Oracle)1-5/+0
Patch series "More folio compat code removal". More code removal with bonus kernel-doc addition. This patch (of 7): All callers have now been converted to filemap_alloc_folio(). Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-05-05mm/memory-failure: pass the folio to collect_procs_ksm()Matthew Wilcox (Oracle)1-11/+3
We've already calculated it, so pass it in instead of recalculating it in collect_procs_ksm(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Jane Chu <[email protected]> Reviewed-by: Miaohe Lin <[email protected]> Cc: Dan Williams <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Oscar Salvador <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-05-05mm: convert hugetlb_page_mapping_lock_write to folioMatthew Wilcox (Oracle)1-3/+3
The page is only used to get the mapping, so the folio will do just as well. Both callers already have a folio available, so this saves a call to compound_head(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Jane Chu  <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Acked-by: Miaohe Lin <[email protected]> Cc: Dan Williams <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-05-05mm/memory-failure: convert shake_page() to shake_folio()Matthew Wilcox (Oracle)1-1/+0
Removes two calls to compound_head(). Move the prototype to internal.h; we definitely don't want code outside mm using it. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Jane Chu <[email protected]> Acked-by: Miaohe Lin <[email protected]> Cc: Dan Williams <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Oscar Salvador <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-05-05mm: return the address from page_mapped_in_vma()Matthew Wilcox (Oracle)1-1/+1
The only user of this function calls page_address_in_vma() immediately after page_mapped_in_vma() calculates it and uses it to return true/false. Return the address instead, allowing memory-failure to skip the call to page_address_in_vma(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Acked-by: Miaohe Lin <[email protected]> Reviewed-by: Jane Chu <[email protected]> Cc: Dan Williams <[email protected]> Cc: Oscar Salvador <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-05-05xarray: don't use "proxy" headersAndy Shevchenko1-1/+5
Update header inclusions to follow IWYU (Include What You Use) principle. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Andy Shevchenko <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Rasmus Villemoes <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-05-05xarray: use BITS_PER_LONGS()Andy Shevchenko1-1/+1
Patch series "xarray: Clean up xarray.h". Main portion of this change is to get rid of kernel.h included into other globally available headers. This decreases a dependency hell degree. The first patch makes it possible to avoid math.h to be included as bitops.h is implied by bitmap.h. This patch (of 2): Use BITS_PER_LONGS() instead of open coded variant. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Andy Shevchenko <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Rasmus Villemoes <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-05-05memcg: simple cleanup of stats update functionsShakeel Butt1-17/+0
mod_memcg_lruvec_state() is never called from outside of memcontrol.c and with always irq disabled. So, replace it with the irq disabled version and add an assert that irq is disabled in the caller. Similarly mod_objcg_state() is not called from outside of memcontrol.c, so simply make it static and change it's name to __mod_objcg_state(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Shakeel Butt <[email protected]> Acked-by: Johannes Weiner <[email protected]> Reviewed-by: T.J. Mercier <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Muchun Song <[email protected]> Cc: Roman Gushchin <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-05-05mm/page-flags: make PageUptodate return boolHao Ge1-1/+1
Make PageUptodate return bool to align the return values of folio_test_uptodate function Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Hao Ge <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Josef Bacik <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Ruihan Li <[email protected]> Cc: Vishal Moola (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-05-05mm/madvise: introduce clear_young_dirty_ptes() batch helperLance Yang2-30/+53
Patch series "mm/madvise: enhance lazyfreeing with mTHP in madvise_free", v10. This patchset adds support for lazyfreeing multi-size THP (mTHP) without needing to first split the large folio via split_folio(). However, we still need to split a large folio that is not fully mapped within the target range. If a large folio is locked or shared, or if we fail to split it, we just leave it in place and advance to the next PTE in the range. But note that the behavior is changed; previously, any failure of this sort would cause the entire operation to give up. As large folios become more common, sticking to the old way could result in wasted opportunities. Performance Testing =================== On an Intel I5 CPU, lazyfreeing a 1GiB VMA backed by PTE-mapped folios of the same size results in the following runtimes for madvise(MADV_FREE) in seconds (shorter is better): Folio Size | Old | New | Change ------------------------------------------ 4KiB | 0.590251 | 0.590259 | 0% 16KiB | 2.990447 | 0.185655 | -94% 32KiB | 2.547831 | 0.104870 | -95% 64KiB | 2.457796 | 0.052812 | -97% 128KiB | 2.281034 | 0.032777 | -99% 256KiB | 2.230387 | 0.017496 | -99% 512KiB | 2.189106 | 0.010781 | -99% 1024KiB | 2.183949 | 0.007753 | -99% 2048KiB | 0.002799 | 0.002804 | 0% This patch (of 4): This commit introduces clear_young_dirty_ptes() to replace mkold_ptes(). By doing so, we can use the same function for both use cases (madvise_pageout and madvise_free), and it also provides the flexibility to only clear the dirty flag in the future if needed. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Lance Yang <[email protected]> Suggested-by: Ryan Roberts <[email protected]> Acked-by: David Hildenbrand <[email protected]> Reviewed-by: Ryan Roberts <[email protected]> Cc: Barry Song <[email protected]> Cc: Jeff Xie <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Muchun Song <[email protected]> Cc: Peter Xu <[email protected]> Cc: Yang Shi <[email protected]> Cc: Yin Fengwei <[email protected]> Cc: Zach O'Keefe <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-05-05buffer: add kernel-doc for bforget() and __bforget()Matthew Wilcox (Oracle)1-0/+10
Distinguish these functions from brelse() and __brelse(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Tested-by: Randy Dunlap <[email protected]> Cc: Pankaj Raghav <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-05-05buffer: add kernel-doc for brelse() and __brelse()Matthew Wilcox (Oracle)1-0/+16
Move the documentation for __brelse() to brelse(), format it as kernel-doc and update it from talking about pages to folios. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Tested-by: Randy Dunlap <[email protected]> Cc: Pankaj Raghav <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-05-05buffer: fix __bread and __bread_gfp kernel-docMatthew Wilcox (Oracle)1-9/+13
The extra indentation confused the kernel-doc parser, so remove it. Fix some other wording while I'm here, and advise the user they need to call brelse() on this buffer. __bread_gfp() isn't used directly by filesystems, but the other wrappers for it don't have documentation, so document it accordingly. Link: https://lkml.kernel.org/r/[email protected] Co-developed-by: Pankaj Raghav <[email protected]> Signed-off-by: Pankaj Raghav <[email protected]> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Tested-by: Randy Dunlap <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-05-05mm: add per-order mTHP anon_swpout and anon_swpout_fallback countersBarry Song1-0/+2
This helps to display the fragmentation situation of the swapfile, knowing the proportion of how much we haven't split large folios. So far, we only support non-split swapout for anon memory, with the possibility of expanding to shmem in the future. So, we add the "anon" prefix to the counter names. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Barry Song <[email protected]> Reviewed-by: Ryan Roberts <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Chris Li <[email protected]> Cc: Domenico Cerasuolo <[email protected]> Cc: Kairui Song <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Peter Xu <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Yosry Ahmed <[email protected]> Cc: Yu Zhao <[email protected]> Cc: Jonathan Corbet <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-05-05mm: add per-order mTHP anon_fault_alloc and anon_fault_fallback countersBarry Song1-0/+21
Patch series "mm: add per-order mTHP alloc and swpout counters", v6. The patchset introduces a framework to facilitate mTHP counters, starting with the allocation and swap-out counters. Currently, only four new nodes are appended to the stats directory for each mTHP size. /sys/kernel/mm/transparent_hugepage/hugepages-<size>/stats anon_fault_alloc anon_fault_fallback anon_fault_fallback_charge anon_swpout anon_swpout_fallback These nodes are crucial for us to monitor the fragmentation levels of both the buddy system and the swap partitions. In the future, we may consider adding additional nodes for further insights. This patch (of 4): Profiling a system blindly with mTHP has become challenging due to the lack of visibility into its operations. Presenting the success rate of mTHP allocations appears to be pressing need. Recently, I've been experiencing significant difficulty debugging performance improvements and regressions without these figures. It's crucial for us to understand the true effectiveness of mTHP in real-world scenarios, especially in systems with fragmented memory. This patch establishes the framework for per-order mTHP counters. It begins by introducing the anon_fault_alloc and anon_fault_fallback counters. Additionally, to maintain consistency with thp_fault_fallback_charge in /proc/vmstat, this patch also tracks anon_fault_fallback_charge when mem_cgroup_charge fails for mTHP. Incorporating additional counters should now be straightforward as well. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Barry Song <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Chris Li <[email protected]> Cc: Domenico Cerasuolo <[email protected]> Cc: Kairui Song <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Peter Xu <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Yosry Ahmed <[email protected]> Cc: Yu Zhao <[email protected]> Cc: Jonathan Corbet <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-05-05mm/hugetlb: rename dissolve_free_huge_pages() to dissolve_free_hugetlb_folios()Sidhartha Kumar1-2/+2
dissolve_free_huge_pages() only uses folios internally, rename it to dissolve_free_hugetlb_folios() and change the comments which reference it. [[email protected]: remove unneeded `extern'] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sidhartha Kumar <[email protected]> Reviewed-by: Vishal Moola (Oracle) <[email protected]> Reviewed-by: Miaohe Lin <[email protected]> Cc: Jane Chu <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Muchun Song <[email protected]> Cc: Oscar Salvador <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-05-05mm/hugetlb: convert dissolve_free_huge_pages() to foliosSidhartha Kumar1-2/+2
Allows us to rename dissolve_free_huge_pages() to dissolve_free_hugetlb_folio(). Convert one caller to pass in a folio directly and use page_folio() to convert the caller in mm/memory-failure. [[email protected]: remove unneeded `extern'] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: v2] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sidhartha Kumar <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Reviewed-by: Vishal Moola (Oracle) <[email protected]> Reviewed-by: Miaohe Lin <[email protected]> Cc: Jane Chu <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Muchun Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-05-05mm: make folio_mapcount() return 0 for small typed foliosDavid Hildenbrand1-2/+12
We already handle it properly for large folios. Let's also return "0" for small typed folios, like page_mapcount() currently would. Consequently, folio_mapcount() will never return negative values for typed folios, but may return negative values for underflows. [[email protected]: make folio_mapcount() slightly more efficient] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Chris Zankel <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: John Paul Adrian Glaubitz <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Max Filippov <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Muchun Song <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Peter Xu <[email protected]> Cc: Richard Chang <[email protected]> Cc: Rich Felker <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Yang Shi <[email protected]> Cc: Yin Fengwei <[email protected]> Cc: Yoshinori Sato <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-05-05mm: improve folio_likely_mapped_shared() using the mapcount of large foliosDavid Hildenbrand1-2/+17
We can now read the mapcount of large folios very efficiently. Use it to improve our handling of partially-mappable folios, falling back to making a guess only in case the folio is not "obviously mapped shared". We can now better detect partially-mappable folios where the first page is not mapped as "mapped shared", reducing "false negatives"; but false negatives are still possible. While at it, fixup a wrong comment (false positive vs. false negative) for KSM folios. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Yin Fengwei <[email protected]> Cc: Chris Zankel <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: John Paul Adrian Glaubitz <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Max Filippov <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Muchun Song <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Peter Xu <[email protected]> Cc: Richard Chang <[email protected]> Cc: Rich Felker <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Yang Shi <[email protected]> Cc: Yoshinori Sato <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-05-05mm: track mapcount of large folios in single valueDavid Hildenbrand3-26/+33
Let's track the mapcount of large folios in a single value. The mapcount of a large folio currently corresponds to the sum of the entire mapcount and all page mapcounts. This sum is what we actually want to know in folio_mapcount() and it is also sufficient for implementing folio_mapped(). With PTE-mapped THP becoming more important and more widely used, we want to avoid looping over all pages of a folio just to obtain the mapcount of large folios. The comment "In the common case, avoid the loop when no pages mapped by PTE" in folio_total_mapcount() does no longer hold for mTHP that are always mapped by PTE. Further, we are planning on using folio_mapcount() more frequently, and might even want to remove page mapcounts for large folios in some kernel configs. Therefore, allow for reading the mapcount of large folios efficiently and atomically without looping over any pages. Maintain the mapcount also for hugetlb pages for simplicity. Use the new mapcount to implement folio_mapcount() and folio_mapped(). Make page_mapped() simply call folio_mapped(). We can now get rid of folio_large_is_mapped(). _nr_pages_mapped is now only used in rmap code and for debugging purposes. Keep folio_nr_pages_mapped() around, but document that its use should be limited to rmap internals and debugging purposes. This change implies one additional atomic add/sub whenever mapping/unmapping (parts of) a large folio. As we now batch RMAP operations for PTE-mapped THP during fork(), during unmap/zap, and when PTE-remapping a PMD-mapped THP, and we adjust the large mapcount for a PTE batch only once, the added overhead in the common case is small. Only when unmapping individual pages of a large folio (e.g., during COW), the overhead might be bigger in comparison, but it's essentially one additional atomic operation. Note that before the new mapcount would overflow, already our refcount would overflow: each mapping requires a folio reference. Extend the focumentation of folio_mapcount(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Yin Fengwei <[email protected]> Cc: Chris Zankel <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: John Paul Adrian Glaubitz <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Max Filippov <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Muchun Song <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Peter Xu <[email protected]> Cc: Richard Chang <[email protected]> Cc: Rich Felker <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Yang Shi <[email protected]> Cc: Yoshinori Sato <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-05-05mm/rmap: add fast-path for small folios when adding/removing/duplicatingDavid Hildenbrand1-0/+13
Let's add a fast-path for small folios to all relevant rmap functions. Note that only RMAP_LEVEL_PTE applies. This is a preparation for tracking the mapcount of large folios in a single value. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Yin Fengwei <[email protected]> Cc: Chris Zankel <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: John Paul Adrian Glaubitz <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Max Filippov <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Muchun Song <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Peter Xu <[email protected]> Cc: Richard Chang <[email protected]> Cc: Rich Felker <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Yang Shi <[email protected]> Cc: Yoshinori Sato <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-05-05mm/rmap: always inline anon/file rmap duplication of a single PTEDavid Hildenbrand1-4/+13
As we grow the code, the compiler might make stupid decisions and unnecessarily degrade fork() performance. Let's make sure to always inline functions that operate on a single PTE so the compiler will always optimize out the loop and avoid a function call. This is a preparation for maintining a total mapcount for large folios. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Yin Fengwei <[email protected]> Cc: Chris Zankel <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: John Paul Adrian Glaubitz <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Max Filippov <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Muchun Song <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Peter Xu <[email protected]> Cc: Richard Chang <[email protected]> Cc: Rich Felker <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Yang Shi <[email protected]> Cc: Yoshinori Sato <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>