aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2021-09-08mm: in_irq() cleanupChangbin Du2-2/+2
Replace the obsolete and ambiguos macro in_irq() with new macro in_hardirq(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Changbin Du <[email protected]> Acked-by: Catalin Marinas <[email protected]> [kmemleak] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08highmem: don't disable preemption on RT in kmap_atomic()Sebastian Andrzej Siewior1-5/+22
kmap_atomic() disables preemption and pagefaults for historical reasons. The conversion to kmap_local(), which only disables migration, cannot be done wholesale because quite some call sites need to be updated to accommodate with the changed semantics. On PREEMPT_RT enabled kernels the kmap_atomic() semantics are problematic due to the implicit disabling of preemption which makes it impossible to acquire 'sleeping' spinlocks within the kmap atomic sections. PREEMPT_RT replaces the preempt_disable() with a migrate_disable() for more than a decade. It could be argued that this is a justification to do this unconditionally, but PREEMPT_RT covers only a limited number of architectures and it disables some functionality which limits the coverage further. Limit the replacement to PREEMPT_RT for now. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Peter Zijlstra <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08mm/early_ioremap.c: remove redundant early_ioremap_shutdown()Weizhao Ouyang2-11/+0
early_ioremap_reset() reserved a weak function so that architectures can provide a specific cleanup. Now no architectures use it, remove this redundant function. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Weizhao Ouyang <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Arnd Bergmann <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08mm: don't allow executable ioremap mappingsChristoph Hellwig1-1/+1
There is no need to execute from iomem (and most platforms it is impossible anyway), so add the pgprot_nx() call similar to vmap. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Zijlstra <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08mm: move ioremap_page_range to vmalloc.cChristoph Hellwig4-34/+19
Patch series "small ioremap cleanups". The first patch moves a little code around the vmalloc/ioremap boundary following a bigger move by Nick earlier. The second enforces non-executable mapping on ioremap just like we do for vmap. No driver currently uses executable mappings anyway, as they should. This patch (of 2): This keeps it together with the implementation, and to remove the vmap_range wrapper. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Nicholas Piggin <[email protected]> Cc: Peter Zijlstra <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08riscv: only select GENERIC_IOREMAP if MMU support is enabledChristoph Hellwig1-1/+1
nommu ioremap is an inline stub in asm-generic/io.h. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Zijlstra <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08mm: remove redundant compound_head() callingMuchun Song2-6/+7
There is a READ_ONCE() in the macro of compound_head(), which will prevent compiler from optimizing the code when there are more than once calling of it in a function. Remove the redundant calling of compound_head() from page_to_index() and page_add_file_rmap() for better code generation. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Reviewed-by: David Howells <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: William Kucharski <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08mm/memory_hotplug: use helper zone_is_zone_device() to simplify the codeMiaohe Lin1-3/+1
Patch series "Cleanup and fixups for memory hotplug". This series contains cleanup to use helper function to simplify the code. Also we fix some potential bugs. More details can be found in the respective changelogs. This patch (of 3): Use helper zone_is_zone_device() to simplify the code and remove some explicit CONFIG_ZONE_DEVICE codes. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Chris Goldsworthy <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08mm/memory_hotplug: improved dynamic memory group aware "auto-movable" online ↵David Hildenbrand3-4/+89
policy Currently, the "auto-movable" online policy does not allow for hotplugged KERNEL (ZONE_NORMAL) memory to increase the amount of MOVABLE memory we can have, primarily, because there is no coordiantion across memory devices and we don't want to create zone-imbalances accidentially when unplugging memory. However, within a single memory device it's different. Let's allow for KERNEL memory within a dynamic memory group to allow for more MOVABLE within the same memory group. The only thing we have to take care of is that the managing driver avoids zone imbalances by unplugging MOVABLE memory first, otherwise there can be corner cases where unplug of memory could result in (accidential) zone imbalances. virtio-mem is the only user of dynamic memory groups and recently added support for prioritizing unplug of ZONE_MOVABLE over ZONE_NORMAL, so we don't need a new toggle to enable it for dynamic memory groups. We limit this handling to dynamic memory groups, because: * We want to keep the runtime overhead for collecting stats when onlining a single memory block small. We tend to have only a handful of dynamic memory groups, but we can have quite some static memory groups (e.g., 256 DIMMs). * It doesn't make too much sense for static memory groups, as we try onlining all applicable memory blocks either completely to ZONE_MOVABLE or not. In ordinary operation, we won't have a mixture of zones within a static memory group. When adding memory to a dynamic memory group, we'll first online memory to ZONE_MOVABLE as long as early KERNEL memory allows for it. Then, we'll online the next unit(s) to ZONE_NORMAL, until we can online the next unit(s) to ZONE_MOVABLE. For a simple virtio-mem device with a MOVABLE:KERNEL ratio of 3:1, it will result in a layout like: [M][M][M][M][M][M][M][M][N][M][M][M][N][M][M][M]... ^ movable memory due to early kernel memory ^ allows for more movable memory ... ^-----^ ... here ^ allows for more movable memory ... ^-----^ ... here While the created layout is sub-optimal when it comes to contiguous zones, it gives us the maximum flexibility when dynamically growing/shrinking a device; we can grow small VMs really big in small steps, and still shrink reliably to e.g., 1/4 of the maximum VM size in this example, removing full memory blocks along with meta data more reliably. Mark dynamic memory groups in the xarray such that we can efficiently iterate over them when collecting stats. In usual setups, we have one virtio-mem device per NUMA node, and usually only a small number of NUMA nodes. Note: for now, there seems to be no compelling reason to make this behavior configurable. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Hui Zhu <[email protected]> Cc: Jason Wang <[email protected]> Cc: Len Brown <[email protected]> Cc: Marek Kedzierski <[email protected]> Cc: "Michael S. Tsirkin" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Pankaj Gupta <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Yang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08mm/memory_hotplug: memory group aware "auto-movable" online policyDavid Hildenbrand3-12/+57
Use memory groups to improve our "auto-movable" onlining policy: 1. For static memory groups (e.g., a DIMM), online a memory block MOVABLE only if all other memory blocks in the group are either MOVABLE or could be onlined MOVABLE. A DIMM will either be MOVABLE or not, not a mixture. 2. For dynamic memory groups (e.g., a virtio-mem device), online a memory block MOVABLE only if all other memory blocks inside the current unit are either MOVABLE or could be onlined MOVABLE. For a virtio-mem device with a device block size with 512 MiB, all 128 MiB memory blocks wihin a 512 MiB unit will either be MOVABLE or not, not a mixture. We have to pass the memory group to zone_for_pfn_range() to take the memory group into account. Note: for now, there seems to be no compelling reason to make this behavior configurable. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Hui Zhu <[email protected]> Cc: Jason Wang <[email protected]> Cc: Len Brown <[email protected]> Cc: Marek Kedzierski <[email protected]> Cc: "Michael S. Tsirkin" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Pankaj Gupta <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Yang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08virtio-mem: use a single dynamic memory group for a single virtio-mem deviceDavid Hildenbrand1-3/+19
Let's use a single dynamic memory group. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Hui Zhu <[email protected]> Cc: Jason Wang <[email protected]> Cc: Len Brown <[email protected]> Cc: Marek Kedzierski <[email protected]> Cc: "Michael S. Tsirkin" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Pankaj Gupta <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Yang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08dax/kmem: use a single static memory group for a single probed unitDavid Hildenbrand1-8/+32
Although dax/kmem users often disable auto-onlining and instead online memory manually (usually to ZONE_MOVABLE), there is still value in having auto-onlining be aware of the relationship of memory blocks. Let's treat one probed unit as a single static memory device, similar to a single ACPI memory device. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Hui Zhu <[email protected]> Cc: Jason Wang <[email protected]> Cc: Len Brown <[email protected]> Cc: Marek Kedzierski <[email protected]> Cc: "Michael S. Tsirkin" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Pankaj Gupta <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Yang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08ACPI: memhotplug: use a single static memory group for a single memory deviceDavid Hildenbrand1-5/+30
Let's group all memory we add for a single memory device - we want a single node for that (which also seems to be the sane thing to do). We won't care for now about memory that was already added to the system (e.g., via e820) -- usually *all* memory of a memory device was already added and we'll fail acpi_memory_enable_device(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Acked-by: Rafael J. Wysocki <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Hui Zhu <[email protected]> Cc: Jason Wang <[email protected]> Cc: Len Brown <[email protected]> Cc: Marek Kedzierski <[email protected]> Cc: "Michael S. Tsirkin" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Pankaj Gupta <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Yang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08mm/memory_hotplug: track present pages in memory groupsDavid Hildenbrand4-14/+34
Let's track all present pages in each memory group. Especially, track memory present in ZONE_MOVABLE and memory present in one of the kernel zones (which really only is ZONE_NORMAL right now as memory groups only apply to hotplugged memory) separately within a memory group, to prepare for making smart auto-online decision for individual memory blocks within a memory group based on group statistics. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Hui Zhu <[email protected]> Cc: Jason Wang <[email protected]> Cc: Len Brown <[email protected]> Cc: Marek Kedzierski <[email protected]> Cc: "Michael S. Tsirkin" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Pankaj Gupta <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Yang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08drivers/base/memory: introduce "memory groups" to logically group memory blocksDavid Hildenbrand4-6/+215
In our "auto-movable" memory onlining policy, we want to make decisions across memory blocks of a single memory device. Examples of memory devices include ACPI memory devices (in the simplest case a single DIMM) and virtio-mem. For now, we don't have a connection between a single memory block device and the real memory device. Each memory device consists of 1..X memory block devices. Let's logically group memory blocks belonging to the same memory device in "memory groups". Memory groups can span multiple physical ranges and a memory group itself does not contain any information regarding physical ranges, only properties (e.g., "max_pages") necessary for improved memory onlining. Introduce two memory group types: 1) Static memory group: E.g., a single ACPI memory device, consisting of 1..X memory resources. A memory group consists of 1..Y memory blocks. The whole group is added/removed in one go. If any part cannot get offlined, the whole group cannot be removed. 2) Dynamic memory group: E.g., a single virtio-mem device. Memory is dynamically added/removed in a fixed granularity, called a "unit", consisting of 1..X memory blocks. A unit is added/removed in one go. If any part of a unit cannot get offlined, the whole unit cannot be removed. In case of 1) we usually want either all memory managed by ZONE_MOVABLE or none. In case of 2) we usually want to have as many units as possible managed by ZONE_MOVABLE. We want a single unit to be of the same type. For now, memory groups are an internal concept that is not exposed to user space; we might want to change that in the future, though. add_memory() users can specify a mgid instead of a nid when passing the MHP_NID_IS_MGID flag. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Hui Zhu <[email protected]> Cc: Jason Wang <[email protected]> Cc: Len Brown <[email protected]> Cc: Marek Kedzierski <[email protected]> Cc: "Michael S. Tsirkin" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Pankaj Gupta <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Yang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08mm/memory_hotplug: introduce "auto-movable" online policyDavid Hildenbrand1-0/+191
When onlining without specifying a zone (using "online" instead of "online_kernel" or "online_movable"), we currently select a zone such that existing zones are kept contiguous. This online policy made sense in the past, where contiguous zones where required. We'd like to implement smarter policies, however: * User space has little insight. As one example, it has no idea which memory blocks logically belong together (e.g., to a DIMM or to a virtio-mem device). * Drivers that add memory in separate memory blocks, especially virtio-mem, want memory to get onlined right from the kernel when adding. So we really want to have onlining to differing zones managed in the kernel, configured by user space. We see more and more cases where we might eventually hotplug a lot of memory in the future (e.g., eventually grow a 2 GiB VM to 64 GiB), however: * Resizing happens dynamically, in smaller steps in both directions (e.g., 2 GiB -> 8 GiB -> 4 GiB -> 16 GiB ...) * We still want as much flexibility as possible, especially, hotunplugging as much memory as possible later. We can really only use "online_movable" if we know that the amount of memory we are going to hotplug upfront, and we know that it won't result in a zone imbalance. So in our example, a 2 GiB VM that could grow to 64 GiB could currently not use "online_movable", and instead, "online_kernel" would have to be used, resulting in worse (no) memory hotunplug reliability. Let's add a new "auto-movable" online policy that considers the current zone ratios (global, per-node) to determine, whether we a memory block can be onlined to ZONE_MOVABLE: MOVABLE : KERNEL However, internally we'll only consider the following ratio for now: MOVABLE : KERNEL_EARLY For now, we don't allow for hotplugged KERNEL memory to allow for more MOVABLE memory, because there is no coordination across memory devices. In follow-up patches, we will allow for more KERNEL memory within a memory device to allow for more MOVABLE memory within the same memory device -- which only makes sense for special memory device types. We base our calculation on "present pages", see the code comments for details. Hotplugged memory will get online to ZONE_MOVABLE if the configured ratio allows for it. Depending on the setup, this can result in fragmented zones, which can make compaction slower and dynamic allocation of gigantic pages when not using CMA less reliable (... which is already pretty unreliable). The old policy will be the default and called "contig-zones". In follow-up patches, our new policy will use additional information, such as memory groups, to make even smarter decisions across memory blocks. Configuration: * memory_hotplug.online_policy is used to switch between both polices and defaults to "contig-zones". * memory_hotplug.auto_movable_ratio defines the maximum ratio is in percent and defaults to "301" -- allowing e.g., most 8 GiB machines to grow to 32 GiB and have all hotplugged memory in ZONE_MOVABLE. The additional percent accounts for a handful of lost present pages (e.g., firmware allocations). User space is expected to adjust this ratio when enabling the new "auto-movable" policy, though. * memory_hotplug.auto_movable_numa_aware considers numa node stats in addition to global stats, and defaults to "true". Note: just like the old policy, the new policy won't take things like unmovable huge pages or memory ballooning that doesn't support balloon compaction into account. User space has to configure onlining accordingly. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Hui Zhu <[email protected]> Cc: Jason Wang <[email protected]> Cc: Len Brown <[email protected]> Cc: Marek Kedzierski <[email protected]> Cc: "Michael S. Tsirkin" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Pankaj Gupta <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Yang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08mm: track present early pages per zoneDavid Hildenbrand5-11/+29
Patch series "mm/memory_hotplug: "auto-movable" online policy and memory groups", v3. I. Goal The goal of this series is improving in-kernel auto-online support. It tackles the fundamental problems that: 1) We can create zone imbalances when onlining all memory blindly to ZONE_MOVABLE, in the worst case crashing the system. We have to know upfront how much memory we are going to hotplug such that we can safely enable auto-onlining of all hotplugged memory to ZONE_MOVABLE via "online_movable". This is far from practical and only applicable in limited setups -- like inside VMs under the RHV/oVirt hypervisor which will never hotplug more than 3 times the boot memory (and the limitation is only in place due to the Linux limitation). 2) We see more setups that implement dynamic VM resizing, hot(un)plugging memory to resize VM memory. In these setups, we might hotplug a lot of memory, but it might happen in various small steps in both directions (e.g., 2 GiB -> 8 GiB -> 4 GiB -> 16 GiB ...). virtio-mem is the primary driver of this upstream right now, performing such dynamic resizing NUMA-aware via multiple virtio-mem devices. Onlining all hotplugged memory to ZONE_NORMAL means we basically have no hotunplug guarantees. Onlining all to ZONE_MOVABLE means we can easily run into zone imbalances when growing a VM. We want a mixture, and we want as much memory as reasonable/configured in ZONE_MOVABLE. Details regarding zone imbalances can be found at [1]. 3) Memory devices consist of 1..X memory block devices, however, the kernel doesn't really track the relationship. Consequently, also user space has no idea. We want to make per-device decisions. As one example, for memory hotunplug it doesn't make sense to use a mixture of zones within a single DIMM: we want all MOVABLE if possible, otherwise all !MOVABLE, because any !MOVABLE part will easily block the whole DIMM from getting hotunplugged. As another example, virtio-mem operates on individual units that span 1..X memory blocks. Similar to a DIMM, we want a unit to either be all MOVABLE or !MOVABLE. A "unit" can be thought of like a DIMM, however, all units of a virtio-mem device logically belong together and are managed (added/removed) by a single driver. We want as much memory of a virtio-mem device to be MOVABLE as possible. 4) We want memory onlining to be done right from the kernel while adding memory, not triggered by user space via udev rules; for example, this is reqired for fast memory hotplug for drivers that add individual memory blocks, like virito-mem. We want a way to configure a policy in the kernel and avoid implementing advanced policies in user space. The auto-onlining support we have in the kernel is not sufficient. All we have is a) online everything MOVABLE (online_movable) b) online everything !MOVABLE (online_kernel) c) keep zones contiguous (online). This series allows configuring c) to mean instead "online movable if possible according to the coniguration, driven by a maximum MOVABLE:KERNEL ratio" -- a new onlining policy. II. Approach This series does 3 things: 1) Introduces the "auto-movable" online policy that initially operates on individual memory blocks only. It uses a maximum MOVABLE:KERNEL ratio to make a decision whether a memory block will be onlined to ZONE_MOVABLE or not. However, in the basic form, hotplugged KERNEL memory does not allow for more MOVABLE memory (details in the patches). CMA memory is treated like MOVABLE memory. 2) Introduces static (e.g., DIMM) and dynamic (e.g., virtio-mem) memory groups and uses group information to make decisions in the "auto-movable" online policy across memory blocks of a single memory device (modeled as memory group). More details can be found in patch #3 or in the DIMM example below. 3) Maximizes ZONE_MOVABLE memory within dynamic memory groups, by allowing ZONE_NORMAL memory within a dynamic memory group to allow for more ZONE_MOVABLE memory within the same memory group. The target use case is dynamic VM resizing using virtio-mem. See the virtio-mem example below. I remember that the basic idea of using a ratio to implement a policy in the kernel was once mentioned by Vitaly Kuznetsov, but I might be wrong (I lost the pointer to that discussion). For me, the main use case is using it along with virtio-mem (and DIMMs / ppc64 dlpar where necessary) for dynamic resizing of VMs, increasing the amount of memory we can hotunplug reliably again if we might eventually hotplug a lot of memory to a VM. III. Target Usage The target usage will be: 1) Linux boots with "mhp_default_online_type=offline" 2) User space (e.g., systemd unit) configures memory onlining (according to a config file and system properties), for example: * Setting memory_hotplug.online_policy=auto-movable * Setting memory_hotplug.auto_movable_ratio=301 * Setting memory_hotplug.auto_movable_numa_aware=true 3) User space enabled auto onlining via "echo online > /sys/devices/system/memory/auto_online_blocks" 4) User space triggers manual onlining of all already-offline memory blocks (go over offline memory blocks and set them to "online") IV. Example For DIMMs, hotplugging 4 GiB DIMMs to a 4 GiB VM with a configured ratio of 301% results in the following layout: Memory block 0-15: DMA32 (early) Memory block 32-47: Normal (early) Memory block 48-79: Movable (DIMM 0) Memory block 80-111: Movable (DIMM 1) Memory block 112-143: Movable (DIMM 2) Memory block 144-275: Normal (DIMM 3) Memory block 176-207: Normal (DIMM 4) ... all Normal (-> hotplugged Normal memory does not allow for more Movable memory) For virtio-mem, using a simple, single virtio-mem device with a 4 GiB VM will result in the following layout: Memory block 0-15: DMA32 (early) Memory block 32-47: Normal (early) Memory block 48-143: Movable (virtio-mem, first 12 GiB) Memory block 144: Normal (virtio-mem, next 128 MiB) Memory block 145-147: Movable (virtio-mem, next 384 MiB) Memory block 148: Normal (virtio-mem, next 128 MiB) Memory block 149-151: Movable (virtio-mem, next 384 MiB) ... Normal/Movable mixture as above (-> hotplugged Normal memory allows for more Movable memory within the same device) Which gives us maximum flexibility when dynamically growing/shrinking a VM in smaller steps. V. Doc Update I'll update the memory-hotplug.rst documentation, once the overhaul [1] is usptream. Until then, details can be found in patch #2. VI. Future Work 1) Use memory groups for ppc64 dlpar 2) Being able to specify a portion of (early) kernel memory that will be excluded from the ratio. Like "128 MiB globally/per node" are excluded. This might be helpful when starting VMs with extremely small memory footprint (e.g., 128 MiB) and hotplugging memory later -- not wanting the first hotplugged units getting onlined to ZONE_MOVABLE. One alternative would be a trigger to not consider ZONE_DMA memory in the ratio. We'll have to see if this is really rrequired. 3) Indicate to user space that MOVABLE might be a bad idea -- especially relevant when memory ballooning without support for balloon compaction is active. This patch (of 9): For implementing a new memory onlining policy, which determines when to online memory blocks to ZONE_MOVABLE semi-automatically, we need the number of present early (boot) pages -- present pages excluding hotplugged pages. Let's track these pages per zone. Pass a page instead of the zone to adjust_present_page_count(), similar as adjust_managed_page_count() and derive the zone from the page. It's worth noting that a memory block to be offlined/onlined is either completely "early" or "not early". add_memory() and friends can only add complete memory blocks and we only online/offline complete (individual) memory blocks. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Cc: "Michael S. Tsirkin" <[email protected]> Cc: Jason Wang <[email protected]> Cc: Marek Kedzierski <[email protected]> Cc: Hui Zhu <[email protected]> Cc: Pankaj Gupta <[email protected]> Cc: Wei Yang <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Dan Williams <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Len Brown <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08ACPI: memhotplug: memory resources cannot be enabled yetDavid Hildenbrand1-4/+0
We allocate + initialize everything from scratch. In case enabling the device fails, we free all memory resourcs. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Acked-by: Rafael J. Wysocki <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Reviewed-by: Pankaj Gupta <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Anton Blanchard <[email protected]> Cc: Ard Biesheuvel <[email protected]> Cc: Baoquan He <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Dave Jiang <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jason Wang <[email protected]> Cc: Jia He <[email protected]> Cc: Joe Perches <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Laurent Dufour <[email protected]> Cc: Len Brown <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: "Michael S. Tsirkin" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Nathan Lynch <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Pankaj Gupta <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Pierre Morel <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Rich Felker <[email protected]> Cc: Scott Cheloha <[email protected]> Cc: Sergei Trofimovich <[email protected]> Cc: Thiago Jung Bauermann <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Vishal Verma <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Yang <[email protected]> Cc: Will Deacon <[email protected]> Cc: Yoshinori Sato <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08mm/memory_hotplug: remove nid parameter from remove_memory() and friendsDavid Hildenbrand6-31/+30
There is only a single user remaining. We can simply lookup the nid only used for node offlining purposes when walking our memory blocks. We don't expect to remove multi-nid ranges; and if we'd ever do, we most probably don't care about removing multi-nid ranges that actually result in empty nodes. If ever required, we can detect the "multi-nid" scenario and simply try offlining all online nodes. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Acked-by: Michael Ellerman <[email protected]> (powerpc) Cc: Michael Ellerman <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Len Brown <[email protected]> Cc: Dan Williams <[email protected]> Cc: Vishal Verma <[email protected]> Cc: Dave Jiang <[email protected]> Cc: "Michael S. Tsirkin" <[email protected]> Cc: Jason Wang <[email protected]> Cc: Nathan Lynch <[email protected]> Cc: Laurent Dufour <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: Scott Cheloha <[email protected]> Cc: Anton Blanchard <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Ard Biesheuvel <[email protected]> Cc: Baoquan He <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jia He <[email protected]> Cc: Joe Perches <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Pankaj Gupta <[email protected]> Cc: Pankaj Gupta <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Pierre Morel <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Rich Felker <[email protected]> Cc: Sergei Trofimovich <[email protected]> Cc: Thiago Jung Bauermann <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Yang <[email protected]> Cc: Will Deacon <[email protected]> Cc: Yoshinori Sato <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08mm/memory_hotplug: remove nid parameter from arch_remove_memory()David Hildenbrand10-22/+11
The parameter is unused, let's remove it. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Acked-by: Catalin Marinas <[email protected]> Acked-by: Michael Ellerman <[email protected]> [powerpc] Acked-by: Heiko Carstens <[email protected]> [s390] Reviewed-by: Pankaj Gupta <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Will Deacon <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Yoshinori Sato <[email protected]> Cc: Rich Felker <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Ard Biesheuvel <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Baoquan He <[email protected]> Cc: Laurent Dufour <[email protected]> Cc: Sergei Trofimovich <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: Thiago Jung Bauermann <[email protected]> Cc: Joe Perches <[email protected]> Cc: Pierre Morel <[email protected]> Cc: Jia He <[email protected]> Cc: Anton Blanchard <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Jiang <[email protected]> Cc: Jason Wang <[email protected]> Cc: Len Brown <[email protected]> Cc: "Michael S. Tsirkin" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Nathan Lynch <[email protected]> Cc: Pankaj Gupta <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Scott Cheloha <[email protected]> Cc: Vishal Verma <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Yang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08mm/memory_hotplug: use "unsigned long" for PFN in zone_for_pfn_range()David Hildenbrand2-4/+4
Patch series "mm/memory_hotplug: preparatory patches for new online policy and memory" These are all cleanups and one fix previously sent as part of [1]: [PATCH v1 00/12] mm/memory_hotplug: "auto-movable" online policy and memory groups. These patches make sense even without the other series, therefore I pulled them out to make the other series easier to digest. [1] https://lkml.kernel.org/r/[email protected] This patch (of 4): Checkpatch complained on a follow-up patch that we are using "unsigned" here, which defaults to "unsigned int" and checkpatch is correct. As we will search for a fitting zone using the wrong pfn, we might end up onlining memory to one of the special kernel zones, such as ZONE_DMA, which can end badly as the onlined memory does not satisfy properties of these zones. Use "unsigned long" instead, just as we do in other places when handling PFNs. This can bite us once we have physical addresses in the range of multiple TB. Link: https://lkml.kernel.org/r/[email protected] Fixes: e5e689302633 ("mm, memory_hotplug: display allowed zones in the preferred ordering") Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Pankaj Gupta <[email protected]> Reviewed-by: Muchun Song <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Cc: "Michael S. Tsirkin" <[email protected]> Cc: Jason Wang <[email protected]> Cc: Pankaj Gupta <[email protected]> Cc: Wei Yang <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Dan Williams <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Len Brown <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: [email protected] Cc: Andy Lutomirski <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: Anton Blanchard <[email protected]> Cc: Ard Biesheuvel <[email protected]> Cc: Baoquan He <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Dave Jiang <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jia He <[email protected]> Cc: Joe Perches <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Laurent Dufour <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: Nathan Lynch <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Pierre Morel <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Rich Felker <[email protected]> Cc: Scott Cheloha <[email protected]> Cc: Sergei Trofimovich <[email protected]> Cc: Thiago Jung Bauermann <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Vishal Verma <[email protected]> Cc: Will Deacon <[email protected]> Cc: Yoshinori Sato <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08mm: memory_hotplug: cleanup after removal of pfn_valid_within()Mike Rapoport1-6/+3
When test_pages_in_a_zone() used pfn_valid_within() is has some logic surrounding pfn_valid_within() checks. Since pfn_valid_within() is gone, this logic can be removed. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONEMike Rapoport8-75/+11
Patch series "mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE". After recent updates to freeing unused parts of the memory map, no architecture can have holes in the memory map within a pageblock. This makes pfn_valid_within() check and CONFIG_HOLES_IN_ZONE configuration option redundant. The first patch removes them both in a mechanical way and the second patch simplifies memory_hotplug::test_pages_in_a_zone() that had pfn_valid_within() surrounded by more logic than simple if. This patch (of 2): After recent changes in freeing of the unused parts of the memory map and rework of pfn_valid() in arm and arm64 there are no architectures that can have holes in the memory map within a pageblock and so nothing can enable CONFIG_HOLES_IN_ZONE which guards non trivial implementation of pfn_valid_within(). With that, pfn_valid_within() is always hardwired to 1 and can be completely removed. Remove calls to pfn_valid_within() and CONFIG_HOLES_IN_ZONE. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08memory-hotplug.rst: complete admin-guide overhaulDavid Hildenbrand1-306/+455
The memory hot(un)plug documentation is outdated and incomplete. Most of the content dates back to 2007, so it's time for a major overhaul. Let's rewrite, reorganize and update most parts of the documentation. In addition to memory hot(un)plug, also add some details regarding ZONE_MOVABLE, with memory hotunplug being one of its main consumers. Drop the file history, that information can more reliably be had from the git log. The style of the document is also properly fixed that e.g., "restview" renders it cleanly now. In the future, we might add some more details about virt users like virtio-mem, the XEN balloon, the Hyper-V balloon and ppc64 dlpar. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: Mike Rapoport <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Muchun Song <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Stephen Rothwell <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08memory-hotplug.rst: remove locking details from admin-guideDavid Hildenbrand1-39/+0
Patch series "memory-hotplug.rst: complete admin-guide overhaul", v3. This patch (of 2): We have the same content at Documentation/core-api/memory-hotplug.rst and it doesn't fit into the admin-guide. The documentation was accidentially duplicated when merging. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Acked-by: Mike Rapoport <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Muchun Song <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Stephen Rothwell <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08drm/i915: clean up inconsistent indentingColin Ian King1-1/+1
There is a statement that is indented one character too deeply, clean this up. Signed-off-by: Colin Ian King <[email protected]> Signed-off-by: Daniel Vetter <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2021-09-08hugetlbfs: s390 is always 64bitDavid Hildenbrand2-2/+2
No need to check for 64BIT. While at it, let's just select ARCH_SUPPORTS_HUGETLBFS from arch/s390/Kconfig. Signed-off-by: David Hildenbrand <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Heiko Carstens <[email protected]>
2021-09-08fbmem: don't allow too huge resolutionsTetsuo Handa1-0/+6
syzbot is reporting page fault at vga16fb_fillrect() [1], for vga16fb_check_var() is failing to detect multiplication overflow. if (vxres * vyres > maxmem) { vyres = maxmem / vxres; if (vyres < yres) return -ENOMEM; } Since no module would accept too huge resolutions where multiplication overflow happens, let's reject in the common path. Link: https://syzkaller.appspot.com/bug?extid=04168c8063cfdde1db5e [1] Reported-by: syzbot <[email protected]> Debugged-by: Randy Dunlap <[email protected]> Signed-off-by: Tetsuo Handa <[email protected]> Reviewed-by: Geert Uytterhoeven <[email protected]> Cc: [email protected] Signed-off-by: Daniel Vetter <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2021-09-08Makefile: use -Wno-main in the full kernel treeRandy Dunlap1-0/+2
When using gcc (SUSE Linux) 7.5.0 (on openSUSE 15.3), I see a build warning: kernel/trace/trace_osnoise.c: In function 'start_kthread': kernel/trace/trace_osnoise.c:1461:8: warning: 'main' is usually a function [-Wmain] void *main = osnoise_main; ^~~~ Quieten that warning by using "-Wno-main". It's OK to use "main" as a declaration name in the kernel. Build-tested on most ARCHes. [ v2: only do it for gcc, since clang doesn't have that particular warning ] Signed-off-by: Randy Dunlap <[email protected]> Link: https://lore.kernel.org/lkml/[email protected]/ Suggested-by: Steven Rostedt <[email protected]> Suggested-by: Linus Torvalds <[email protected]> Cc: Daniel Bristot de Oliveira <[email protected]> Cc: Masahiro Yamada <[email protected]> Cc: Michal Marek <[email protected]> Cc: [email protected] Signed-off-by: Linus Torvalds <[email protected]>
2021-09-08Merge tag 'asoc-fix-v5.15-rc1' of ↵Takashi Iwai9-39/+47
https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus ASoC: Fixes for v5.15 A collection of fixes that came in during the merge window, nothing too remarkable but a reasonably large number of fixes.
2021-09-08time: Handle negative seconds correctly in timespec64_to_ns()Lukas Hannen1-2/+7
timespec64_ns() prevents multiplication overflows by comparing the seconds value of the timespec to KTIME_SEC_MAX. If the value is greater or equal it returns KTIME_MAX. But that check casts the signed seconds value to unsigned which makes the comparision true for all negative values and therefore return wrongly KTIME_MAX. Negative second values are perfectly valid and required in some places, e.g. ptp_clock_adjtime(). Remove the cast and add a check for the negative boundary which is required to prevent undefined behaviour due to multiplication underflow. Fixes: cb47755725da ("time: Prevent undefined behaviour in timespec64_to_ns()")' Signed-off-by: Lukas Hannen <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/r/AM6PR01MB541637BD6F336B8FFB72AF80EEC69@AM6PR01MB5416.eurprd01.prod.exchangelabs.com
2021-09-08Merge branches 'acpi-pm' and 'acpi-docs'Rafael J. Wysocki2-52/+64
* acpi-pm: ACPI: PM: s2idle: Run both AMD and Microsoft methods if both are supported * acpi-docs: Documentation: ACPI: Align the SSDT overlays file with the code
2021-09-08Merge branch 'pm-opp'Rafael J. Wysocki18-639/+757
* pm-opp: dt-bindings: opp: Convert to DT schema dt-bindings: Clean-up OPP binding node names in examples ARM: dts: omap: Drop references to opp.txt
2021-09-08Merge branch 'pm-cpufreq'Rafael J. Wysocki21-128/+684
* pm-cpufreq: Revert "cpufreq: intel_pstate: Process HWP Guaranteed change notification" cpufreq: mediatek-hw: Add support for CPUFREQ HW cpufreq: Add of_perf_domain_get_sharing_cpumask dt-bindings: cpufreq: add bindings for MediaTek cpufreq HW cpufreq: Remove ready() callback cpufreq: sh: Remove sh_cpufreq_cpu_ready() cpufreq: acpi: Remove acpi_cpufreq_cpu_ready() cpufreq: qcom-hw: Set dvfs_possible_from_any_cpu cpufreq driver flag cpufreq: blocklist more Qualcomm platforms in cpufreq-dt-platdev cpufreq: qcom-cpufreq-hw: Add dcvs interrupt support cpufreq: scmi: Use .register_em() to register with energy model cpufreq: vexpress: Use .register_em() to register with energy model cpufreq: scpi: Use .register_em() to register with energy model cpufreq: qcom-cpufreq-hw: Use .register_em() to register with energy model cpufreq: omap: Use .register_em() to register with energy model cpufreq: mediatek: Use .register_em() to register with energy model cpufreq: imx6q: Use .register_em() to register with energy model cpufreq: dt: Use .register_em() to register with energy model cpufreq: Add callback to register with energy model cpufreq: vexpress: Set CPUFREQ_IS_COOLING_DEV flag
2021-09-08io-wq: fix cancellation on create-worker failurePavel Begunkov1-9/+20
WARNING: CPU: 0 PID: 10392 at fs/io_uring.c:1151 req_ref_put_and_test fs/io_uring.c:1151 [inline] WARNING: CPU: 0 PID: 10392 at fs/io_uring.c:1151 req_ref_put_and_test fs/io_uring.c:1146 [inline] WARNING: CPU: 0 PID: 10392 at fs/io_uring.c:1151 io_req_complete_post+0xf5b/0x1190 fs/io_uring.c:1794 Modules linked in: Call Trace: tctx_task_work+0x1e5/0x570 fs/io_uring.c:2158 task_work_run+0xe0/0x1a0 kernel/task_work.c:164 tracehook_notify_signal include/linux/tracehook.h:212 [inline] handle_signal_work kernel/entry/common.c:146 [inline] exit_to_user_mode_loop kernel/entry/common.c:172 [inline] exit_to_user_mode_prepare+0x232/0x2a0 kernel/entry/common.c:209 __syscall_exit_to_user_mode_work kernel/entry/common.c:291 [inline] syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:302 do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86 entry_SYSCALL_64_after_hwframe+0x44/0xae When io_wqe_enqueue() -> io_wqe_create_worker() fails, we can't just call io_run_cancel() to clean up the request, it's already enqueued via io_wqe_insert_work() and will be executed either by some other worker during cancellation (e.g. in io_wq_put_and_exit()). Reported-by: Hao Sun <[email protected]> Fixes: 3146cba99aa28 ("io-wq: make worker creation resilient against signals") Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/93b9de0fcf657affab0acfd675d4abcd273ee863.1631092071.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-09-08s390/ftrace: remove incorrect __va usageHeiko Carstens1-2/+2
The address of ftrace_graph_caller is already virtual. Using __va() to translate the address is wrong. Reviewed-by: Alexander Gordeev <[email protected]> Signed-off-by: Heiko Carstens <[email protected]>
2021-09-08s390/zcrypt: remove incorrect kernel doc indicatorsHeiko Carstens6-48/+48
Many comments above functions start with a kernel doc indicator, but the comments are not using kernel doc style. Get rid of the warnings by simply removing the indicator. E.g.: drivers/s390/crypto/zcrypt_msgtype6.c:111: warning: This comment starts with '/**', but isn't a kernel-doc comment. Reviewed-by: Harald Freudenberger <[email protected]> Signed-off-by: Heiko Carstens <[email protected]>
2021-09-08scsi: zfcp: fix kernel doc commentsHeiko Carstens4-6/+6
A couple of function names don't match what the kernel doc comments indicate. Acked-by: Benjamin Block <[email protected]> Signed-off-by: Heiko Carstens <[email protected]>
2021-09-08s390/sclp: add __nonstring annotationHeiko Carstens1-1/+1
Add __nonstring annotation, since the missing string termination for id member of sclp_trace_entry is intended. This way we get rid of this warning: drivers/s390/char/sclp.c:84:9: warning: ‘strncpy’ output truncated before terminating nul copying 4 bytes from a string of the same length [-Wstringop-truncation] 84 | strncpy(e.id, id, sizeof(e.id)); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Reported-by: kernel test robot <[email protected]> Reviewed-by: Peter Oberparleiter <[email protected]> Signed-off-by: Heiko Carstens <[email protected]>
2021-09-08IB/hfi1: make hist staticchongjiapeng1-1/+1
This symbol is not used outside of trace.c, so marks it static. Fix the following sparse warning: drivers/infiniband/hw/hfi1/trace.c:491:23: warning: symbol 'hist' was not declared. Should it be static? Link: https://lore.kernel.org/r/1630921723-21545-1-git-send-email-jiapeng.chong@linux.alibaba.com Reported-by: Abaci Robot <[email protected]> Signed-off-by: chongjiapeng <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2021-09-08RDMA/bnxt_re: Prefer kcalloc over open coded arithmeticLen Baker1-2/+2
As noted in the "Deprecated Interfaces, Language Features, Attributes, and Conventions" documentation [1], size calculations (especially multiplication) should not be performed in memory allocator (or similar) function arguments due to the risk of them overflowing. This could lead to values wrapping around and a smaller allocation being made than the caller was expecting. Using those allocations could lead to linear overflows of heap memory and other misbehaviors. In this case this is not actually dynamic sizes: both sides of the multiplication are constant values. However it is best to refactor this anyway, just to keep the open-coded math idiom out of code. So, use the purpose specific kcalloc() function instead of the argument size * count in the kzalloc() function. Also, remove the unnecessary initialization of the sqp_tbl variable since it is set a few lines later. [1] https://www.kernel.org/doc/html/v5.14/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Len Baker <[email protected]> Reviewed-by: Leon Romanovsky <[email protected]> Acked-by: Selvin Xavier <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2021-09-08IB/qib: Fix null pointer subtraction compiler warningJason Gunthorpe1-1/+3
>> drivers/infiniband/hw/qib/qib_sysfs.c:411:1: warning: performing pointer subtraction with a null pointer has undefined behavior +[-Wnull-pointer-subtraction] QIB_DIAGC_ATTR(rc_resends); ^~~~~~~~~~~~~~~~~~~~~~~~~~ drivers/infiniband/hw/qib/qib_sysfs.c:408:51: note: expanded from macro 'QIB_DIAGC_ATTR' .counter = &((struct qib_ibport *)0)->rvp.n_##N - (u64 *)0, \ Use offsetof and accomplish the type check using static_assert. Fixes: 4a7aaf88c89f ("RDMA/qib: Use attributes for the port sysfs") Link: https://lore.kernel.org/r/[email protected] Reported-by: kernel test robot <[email protected]> Acked-by: Dennis Dalessandro <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2021-09-08RDMA/mlx5: Fix xlt_chunk_align calculationNiklas Schnelle1-1/+1
The XLT chunk alignment depends on ent_size not sizeof(ent_size) aka sizeof(size_t). The incoming ent_size is either 8 or 16, so the miscalculation when 16 is required is only an over-alignment and functional harmless. Fixes: 8010d74b9965 ("RDMA/mlx5: Split the WR setup out of mlx5_ib_update_xlt()") Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Niklas Schnelle <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2021-09-08RDMA/mlx5: Fix number of allocated XLT entriesNiklas Schnelle1-1/+1
In commit 8010d74b9965b ("RDMA/mlx5: Split the WR setup out of mlx5_ib_update_xlt()") the allocation logic was split out of mlx5_ib_update_xlt() and the logic was changed to enable better OOM handling. Sadly this change introduced a miscalculation of the number of entries that were actually allocated when under memory pressure where it can actually become 0 which on s390 lets dma_map_single() fail. It can also lead to corruption of the free pages list when the wrong number of entries is used in the calculation of sg->length which is used as argument for free_pages(). Fix this by using the allocation size instead of misusing get_order(size). Cc: [email protected] Fixes: 8010d74b9965 ("RDMA/mlx5: Split the WR setup out of mlx5_ib_update_xlt()") Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Niklas Schnelle <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2021-09-08drm/i915/selftests: fixup igt_shrink_thpMatthew Auld1-7/+23
Since the object might still be active here, the shrink_all will simply ignore it, which blows up in the test, since the pages will still be there. Currently THP is disabled which should result in the test being skipped, but if we ever re-enable THP we might start seeing the failure. Fix this by forcing I915_SHRINK_ACTIVE. v2: Some machine in the shard runs doesn't seem to have any available swap when running this test. Try to handle this. Signed-off-by: Matthew Auld <[email protected]> Cc: Tvrtko Ursulin <[email protected]> Cc: Thomas Hellström <[email protected]> Reviewed-by: Tvrtko Ursulin <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2021-09-08drm/i915/gtt: add some flushing for the 64K GTT pathMatthew Auld1-0/+2
If we need to mark the PDE as operating in 64K GTT mode, we should be paranoid and flush the extra writes, like we already do for the PTEs. On some platforms the clflush can apparently add the just the right amount of magical delay to force the GPU to see the updated entry. Signed-off-by: Matthew Auld <[email protected]> Cc: Mika Kuoppala <[email protected]> Reviewed-by: Mika Kuoppala <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2021-09-07drm/i915/gt: Add separate MOCS table for Gen12 devices other than TGL/RKLAyaz A Siddiqui1-4/+36
MOCS table of TGL/RKL has MOCS[1] set to L3_UC. While for other gen12 devices we need to set MOCS[1] as L3_WB, So adding a new MOCS table for other gen 12 devices eg. ADL. Fixes: cfbe5291a189 ("drm/i915/gt: Initialize unused MOCS entries with device specific values") Cc: Matt Roper <[email protected]> Signed-off-by: Ayaz A Siddiqui <[email protected]> Reviewed-by: Matt Roper <[email protected]> [mattrope: fix whitespace error] Signed-off-by: Matt Roper <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2021-09-07Merge tag 'pci-v5.15-changes' of ↵Linus Torvalds100-1555/+3860
git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci Pull PCI updates from Bjorn Helgaas: "Enumeration: - Convert controller drivers to generic_handle_domain_irq() (Marc Zyngier) - Simplify VPD (Vital Product Data) access and search (Heiner Kallweit) - Update bnx2, bnx2x, bnxt, cxgb4, cxlflash, sfc, tg3 drivers to use simplified VPD interfaces (Heiner Kallweit) - Run Max Payload Size quirks before configuring MPS; work around ASMedia ASM1062 SATA MPS issue (Marek Behún) Resource management: - Refactor pci_ioremap_bar() and pci_ioremap_wc_bar() (Krzysztof Wilczyński) - Optimize pci_resource_len() to reduce kernel size (Zhen Lei) PCI device hotplug: - Fix a double unmap in ibmphp (Vishal Aslot) PCIe port driver: - Enable Bandwidth Notification only if port supports it (Stuart Hayes) Sysfs/proc/syscalls: - Add schedule point in proc_bus_pci_read() (Krzysztof Wilczyński) - Return ~0 data on pciconfig_read() CAP_SYS_ADMIN failure (Krzysztof Wilczyński) - Return "int" from pciconfig_read() syscall (Krzysztof Wilczyński) Virtualization: - Extend "pci=noats" to also turn on Translation Blocking to protect against some DMA attacks (Alex Williamson) - Add sysfs mechanism to control the type of reset used between device assignments to VMs (Amey Narkhede) - Add support for ACPI _RST reset method (Shanker Donthineni) - Add ACS quirks for Cavium multi-function devices (George Cherian) - Add ACS quirks for NXP LX2xx0 and LX2xx2 platforms (Wasim Khan) - Allow HiSilicon AMBA devices that appear as fake PCI devices to use PASID and SVA (Zhangfei Gao) Endpoint framework: - Add support for SR-IOV Endpoint devices (Kishon Vijay Abraham I) - Zero-initialize endpoint test tool parameters so we don't use random parameters (Shunyong Yang) APM X-Gene PCIe controller driver: - Remove redundant dev_err() call in xgene_msi_probe() (ErKun Yang) Broadcom iProc PCIe controller driver: - Don't fail devm_pci_alloc_host_bridge() on missing 'ranges' because it's optional on BCMA devices (Rob Herring) - Fix BCMA probe resource handling (Rob Herring) Cadence PCIe driver: - Work around J7200 Link training electrical issue by increasing delays in LTSSM (Nadeem Athani) Intel IXP4xx PCI controller driver: - Depend on ARCH_IXP4XX to avoid useless config questions (Geert Uytterhoeven) Intel Keembay PCIe controller driver: - Add Intel Keem Bay PCIe controller (Srikanth Thokala) Marvell Aardvark PCIe controller driver: - Work around config space completion handling issues (Evan Wang) - Increase timeout for config access completions (Pali Rohár) - Emulate CRS Software Visibility bit (Pali Rohár) - Configure resources from DT 'ranges' property to fix I/O space access (Pali Rohár) - Serialize INTx mask/unmask (Pali Rohár) MediaTek PCIe controller driver: - Add MT7629 support in DT (Chuanjia Liu) - Fix an MSI issue (Chuanjia Liu) - Get syscon regmap ("mediatek,generic-pciecfg"), IRQ number ("pci_irq"), PCI domain ("linux,pci-domain") from DT properties if present (Chuanjia Liu) Microsoft Hyper-V host bridge driver: - Add ARM64 support (Boqun Feng) - Support "Create Interrupt v3" message (Sunil Muthuswamy) NVIDIA Tegra PCIe controller driver: - Use seq_puts(), move err_msg from stack to static, fix OF node leak (Christophe JAILLET) NVIDIA Tegra194 PCIe driver: - Disable suspend when in Endpoint mode (Om Prakash Singh) - Fix MSI-X address programming error (Om Prakash Singh) - Disable interrupts during suspend to avoid spurious AER link down (Om Prakash Singh) Renesas R-Car PCIe controller driver: - Work around hardware issue that prevents Link L1->L0 transition (Marek Vasut) - Fix runtime PM refcount leak (Dinghao Liu) Rockchip DesignWare PCIe controller driver: - Add Rockchip RK356X host controller driver (Simon Xue) TI J721E PCIe driver: - Add support for J7200 and AM64 (Kishon Vijay Abraham I) Toshiba Visconti PCIe controller driver: - Add Toshiba Visconti PCIe host controller driver (Nobuhiro Iwamatsu) Xilinx NWL PCIe controller driver: - Enable PCIe reference clock via CCF (Hyun Kwon) Miscellaneous: - Convert sta2x11 from 'pci_' to 'dma_' API (Christophe JAILLET) - Fix pci_dev_str_match_path() alloc while atomic bug (used for kernel parameters that specify devices) (Dan Carpenter) - Remove pointless Precision Time Management warning when PTM is present but not enabled (Jakub Kicinski) - Remove surplus "break" statements (Krzysztof Wilczyński)" * tag 'pci-v5.15-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (132 commits) PCI: ibmphp: Fix double unmap of io_mem x86/PCI: sta2x11: switch from 'pci_' to 'dma_' API PCI/VPD: Use unaligned access helpers PCI/VPD: Clean up public VPD defines and inline functions cxgb4: Use pci_vpd_find_id_string() to find VPD ID string PCI/VPD: Add pci_vpd_find_id_string() PCI/VPD: Include post-processing in pci_vpd_find_tag() PCI/VPD: Stop exporting pci_vpd_find_info_keyword() PCI/VPD: Stop exporting pci_vpd_find_tag() PCI: Set dma-can-stall for HiSilicon chips PCI: rockchip-dwc: Add Rockchip RK356X host controller driver PCI: dwc: Remove surplus break statement after return PCI: artpec6: Remove local code block from switch statement PCI: artpec6: Remove surplus break statement after return MAINTAINERS: Add entries for Toshiba Visconti PCIe controller PCI: visconti: Add Toshiba Visconti PCIe host controller driver PCI/portdrv: Enable Bandwidth Notification only if port supports it PCI: Allow PASID on fake PCIe devices without TLP prefixes PCI: mediatek: Use PCI domain to handle ports detection PCI: mediatek: Add new method to get irq number ...
2021-09-07kbuild: Only default to -Werror if COMPILE_TESTMarco Elver1-1/+1
The cross-product of the kernel's supported toolchains, architectures, and configuration options is large. So large, that it's generally accepted to be infeasible to enumerate and build+test them all (many compile-testers rely on randomly generated configs). Without the possibility to enumerate all possible combinations of toolchains, architectures, and configuration options, it is inevitable that compiler warnings in this space exist. With -Werror, this means that an innumerable set of kernels are now broken, yet had been perfectly usable before (confused compilers, code with warnings unused, or luck). Distributors will necessarily pick a point in the toolchain X arch X config space, and if unlucky, will have a broken build. Granted, those will likely disable CONFIG_WERROR and move on. The kernel's default configuration is unlikely to be suitable for all users, but it's inappropriate to force many users to set CONFIG_WERROR=n. This also holds for CI systems which are focused on runtime testing, where the odd warning in some subsystem will disrupt testing of the rest of the kernel. Many of those runtime-focused CI systems run tests or fuzz the kernel using runtime debugging tools. Runtime testing of different subsystems can proceed in parallel, and potentially uncover serious bugs; halting runtime testing of the entire kernel because of the odd warning (now error) in a subsystem or driver is simply inappropriate. Therefore, runtime-focused CI systems will likely choose CONFIG_WERROR=n as well. The appropriate usecase for -Werror is therefore compile-test focused builds (often done by developers or CI systems). Reflect this in the Kconfig option by making the default value of WERROR match COMPILE_TEST. Signed-off-by: Marco Elver <[email protected]> Acked-by: Guenter Roeck <[email protected]> Acked-by: Randy Dunlap <[email protected]> Reviwed-by: Mark Brown <[email protected]> Reviewed-by: Nathan Chancellor <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-07tracing: Fix some alloc_event_probe() error handling bugsDan Carpenter1-2/+3
There are two bugs in this code. First, if the kzalloc() fails it leads to a NULL dereference of "ep" on the next line. Second, if the alloc_event_probe() function returns an error then it leads to an error pointer dereference in the caller. Link: https://lkml.kernel.org/r/20210824115150.GI31143@kili Fixes: 7491e2c44278 ("tracing: Add a probe that attaches to trace events") Signed-off-by: Dan Carpenter <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>