Age | Commit message (Collapse) | Author | Files | Lines |
|
To avoid the unbounded increase of the overhead, DAMON groups adjacent
pages that are assumed to have the same access frequencies into a
region. As long as the assumption (pages in a region have the same
access frequencies) is kept, only one page in the region is required to
be checked. Thus, for each ``sampling interval``,
1. the 'prepare_access_checks' primitive picks one page in each region,
2. waits for one ``sampling interval``,
3. checks whether the page is accessed meanwhile, and
4. increases the access count of the region if so.
Therefore, the monitoring overhead is controllable by adjusting the
number of regions. DAMON allows both the underlying primitives and user
callbacks to adjust regions for the trade-off. In other words, this
commit makes DAMON to use not only time-based sampling but also
space-based sampling.
This scheme, however, cannot preserve the quality of the output if the
assumption is not guaranteed. Next commit will address this problem.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Reviewed-by: Leonard Foerster <[email protected]>
Reviewed-by: Fernand Sieber <[email protected]>
Acked-by: Shakeel Butt <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Amit Shah <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Brendan Higgins <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: David Woodhouse <[email protected]>
Cc: Fan Du <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Greg Thelen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: Markus Boehme <[email protected]>
Cc: Maximilian Heyne <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Steven Rostedt (VMware) <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "Introduce Data Access MONitor (DAMON)", v34.
Introduction
============
DAMON is a data access monitoring framework for the Linux kernel. The
core mechanisms of DAMON called 'region based sampling' and 'adaptive
regions adjustment' (refer to 'mechanisms.rst' in the 11th patch of this
patchset for the detail) make it
- accurate (The monitored information is useful for DRAM level memory
management. It might not appropriate for Cache-level accuracy,
though.),
- light-weight (The monitoring overhead is low enough to be applied
online while making no impact on the performance of the target
workloads.), and
- scalable (the upper-bound of the instrumentation overhead is
controllable regardless of the size of target workloads.).
Using this framework, therefore, several memory management mechanisms such
as reclamation and THP can be optimized to aware real data access
patterns. Experimental access pattern aware memory management
optimization works that incurring high instrumentation overhead will be
able to have another try.
Though DAMON is for kernel subsystems, it can be easily exposed to the
user space by writing a DAMON-wrapper kernel subsystem. Then, user space
users who have some special workloads will be able to write personalized
tools or applications for deeper understanding and specialized
optimizations of their systems.
DAMON is also merged in two public Amazon Linux kernel trees that based on
v5.4.y[1] and v5.10.y[2].
[1] https://github.com/amazonlinux/linux/tree/amazon-5.4.y/master/mm/damon
[2] https://github.com/amazonlinux/linux/tree/amazon-5.10.y/master/mm/damon
The userspace tool[1] is available, released under GPLv2, and actively
being maintained. I am also planning to implement another basic user
interface in perf[2]. Also, the basic test suite for DAMON is available
under GPLv2[3].
[1] https://github.com/awslabs/damo
[2] https://lore.kernel.org/linux-mm/[email protected]/
[3] https://github.com/awslabs/damon-tests
Long-term Plan
--------------
DAMON is a part of a project called Data Access-aware Operating System
(DAOS). As the name implies, I want to improve the performance and
efficiency of systems using fine-grained data access patterns. The
optimizations are for both kernel and user spaces. I will therefore
modify or create kernel subsystems, export some of those to user space and
implement user space library / tools. Below shows the layers and
components for the project.
---------------------------------------------------------------------------
Primitives: PTE Accessed bit, PG_idle, rmap, (Intel CMT), ...
Framework: DAMON
Features: DAMOS, virtual addr, physical addr, ...
Applications: DAMON-debugfs, (DARC), ...
^^^^^^^^^^^^^^^^^^^^^^^ KERNEL SPACE ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Raw Interface: debugfs, (sysfs), (damonfs), tracepoints, (sys_damon), ...
vvvvvvvvvvvvvvvvvvvvvvv USER SPACE vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
Library: (libdamon), ...
Tools: DAMO, (perf), ...
---------------------------------------------------------------------------
The components in parentheses or marked as '...' are not implemented yet
but in the future plan. IOW, those are the TODO tasks of DAOS project.
For more detail, please refer to the plans:
https://lore.kernel.org/linux-mm/[email protected]/
Evaluations
===========
We evaluated DAMON's overhead, monitoring quality and usefulness using 24
realistic workloads on my QEMU/KVM based virtual machine running a kernel
that v24 DAMON patchset is applied.
DAMON is lightweight. It increases system memory usage by 0.39% and slows
target workloads down by 1.16%.
DAMON is accurate and useful for memory management optimizations. An
experimental DAMON-based operation scheme for THP, namely 'ethp', removes
76.15% of THP memory overheads while preserving 51.25% of THP speedup.
Another experimental DAMON-based 'proactive reclamation' implementation,
'prcl', reduces 93.38% of residential sets and 23.63% of system memory
footprint while incurring only 1.22% runtime overhead in the best case
(parsec3/freqmine).
NOTE that the experimental THP optimization and proactive reclamation are
not for production but only for proof of concepts.
Please refer to the official document[1] or "Documentation/admin-guide/mm:
Add a document for DAMON" patch in this patchset for detailed evaluation
setup and results.
[1] https://damonitor.github.io/doc/html/latest-damon/admin-guide/mm/damon/eval.html
Real-world User Story
=====================
In summary, DAMON has used on production systems and proved its usefulness.
DAMON as a profiler
-------------------
We analyzed characteristics of a large scale production systems of our
customers using DAMON. The systems utilize 70GB DRAM and 36 CPUs. From
this, we were able to find interesting things below.
There were obviously different access pattern under idle workload and
active workload. Under the idle workload, it accessed large memory
regions with low frequency, while the active workload accessed small
memory regions with high freuqnecy.
DAMON found a 7GB memory region that showing obviously high access
frequency under the active workload. We believe this is the
performance-effective working set and need to be protected.
There was a 4KB memory region that showing highest access frequency under
not only active but also idle workloads. We think this must be a hottest
code section like thing that should never be paged out.
For this analysis, DAMON used only 0.3-1% of single CPU time. Because we
used recording-based analysis, it consumed about 3-12 MB of disk space per
20 minutes. This is only small amount of disk space, but we can further
reduce the disk usage by using non-recording-based DAMON features. I'd
like to argue that only DAMON can do such detailed analysis (finding 4KB
highest region in 70GB memory) with the light overhead.
DAMON as a system optimization tool
-----------------------------------
We also found below potential performance problems on the systems and made
DAMON-based solutions.
The system doesn't want to make the workload suffer from the page
reclamation and thus it utilizes enough DRAM but no swap device. However,
we found the system is actively reclaiming file-backed pages, because the
system has intensive file IO. The file IO turned out to be not
performance critical for the workload, but the customer wanted to ensure
performance critical file-backed pages like code section to not mistakenly
be evicted.
Using direct IO should or `mlock()` would be a straightforward solution,
but modifying the user space code is not easy for the customer.
Alternatively, we could use DAMON-based operation scheme[1]. By using it,
we can ask DAMON to track access frequency of each region and make
'process_madvise(MADV_WILLNEED)[2]' call for regions having specific size
and access frequency for a time interval.
We also found the system is having high number of TLB misses. We tried
'always' THP enabled policy and it greatly reduced TLB misses, but the
page reclamation also been more frequent due to the THP internal
fragmentation caused memory bloat. We could try another DAMON-based
operation scheme that applies 'MADV_HUGEPAGE' to memory regions having
>=2MB size and high access frequency, while applying 'MADV_NOHUGEPAGE' to
regions having <2MB size and low access frequency.
We do not own the systems so we only reported the analysis results and
possible optimization solutions to the customers. The customers satisfied
about the analysis results and promised to try the optimization guides.
[1] https://lore.kernel.org/linux-mm/[email protected]/
[2] https://lore.kernel.org/linux-api/[email protected]/
Comparison with Idle Page Tracking
==================================
Idle Page Tracking allows users to set and read idleness of pages using a
bitmap file which represents each page with each bit of the file. One
recommended usage of it is working set size detection. Users can do that
by
1. find PFN of each page for workloads in interest,
2. set all the pages as idle by doing writes to the bitmap file,
3. wait until the workload accesses its working set, and
4. read the idleness of the pages again and count pages became not idle.
NOTE: While Idle Page Tracking is for user space users, DAMON is primarily
designed for kernel subsystems though it can easily exposed to the user
space. Hence, this section only assumes such user space use of DAMON.
For what use cases Idle Page Tracking would be better?
------------------------------------------------------
1. Flexible usecases other than hotness monitoring.
Because Idle Page Tracking allows users to control the primitive (Page
idleness) by themselves, Idle Page Tracking users can do anything they
want. Meanwhile, DAMON is primarily designed to monitor the hotness of
each memory region. For this, DAMON asks users to provide sampling
interval and aggregation interval. For the reason, there could be some
use case that using Idle Page Tracking is simpler.
2. Physical memory monitoring.
Idle Page Tracking receives PFN range as input, so natively supports
physical memory monitoring.
DAMON is designed to be extensible for multiple address spaces and use
cases by implementing and using primitives for the given use case.
Therefore, by theory, DAMON has no limitation in the type of target
address space as long as primitives for the given address space exists.
However, the default primitives introduced by this patchset supports only
virtual address spaces.
Therefore, for physical memory monitoring, you should implement your own
primitives and use it, or simply use Idle Page Tracking.
Nonetheless, RFC patchsets[1] for the physical memory address space
primitives is already available. It also supports user memory same to
Idle Page Tracking.
[1] https://lore.kernel.org/linux-mm/[email protected]/
For what use cases DAMON is better?
-----------------------------------
1. Hotness Monitoring.
Idle Page Tracking let users know only if a page frame is accessed or not.
For hotness check, the user should write more code and use more memory.
DAMON do that by itself.
2. Low Monitoring Overhead
DAMON receives user's monitoring request with one step and then provide
the results. So, roughly speaking, DAMON require only O(1) user/kernel
context switches.
In case of Idle Page Tracking, however, because the interface receives
contiguous page frames, the number of user/kernel context switches
increases as the monitoring target becomes complex and huge. As a result,
the context switch overhead could be not negligible.
Moreover, DAMON is born to handle with the monitoring overhead. Because
the core mechanism is pure logical, Idle Page Tracking users might be able
to implement the mechanism on their own, but it would be time consuming
and the user/kernel context switching will still more frequent than that
of DAMON. Also, the kernel subsystems cannot use the logic in this case.
3. Page granularity working set size detection.
Until v22 of this patchset, this was categorized as the thing Idle Page
Tracking could do better, because DAMON basically maintains additional
metadata for each of the monitoring target regions. So, in the page
granularity working set size detection use case, DAMON would incur (number
of monitoring target pages * size of metadata) memory overhead. Size of
the single metadata item is about 54 bytes, so assuming 4KB pages, about
1.3% of monitoring target pages will be additionally used.
All essential metadata for Idle Page Tracking are embedded in 'struct
page' and page table entries. Therefore, in this use case, only one
counter variable for working set size accounting is required if Idle Page
Tracking is used.
There are more details to consider, but roughly speaking, this is true in
most cases.
However, the situation changed from v23. Now DAMON supports arbitrary
types of monitoring targets, which don't use the metadata. Using that,
DAMON can do the working set size detection with no additional space
overhead but less user-kernel context switch. A first draft for the
implementation of monitoring primitives for this usage is available in a
DAMON development tree[1]. An RFC patchset for it based on this patchset
will also be available soon.
Since v24, the arbitrary type support is dropped from this patchset
because this patchset doesn't introduce real use of the type. You can
still get it from the DAMON development tree[2], though.
[1] https://github.com/sjp38/linux/tree/damon/pgidle_hack
[2] https://github.com/sjp38/linux/tree/damon/master
4. More future usecases
While Idle Page Tracking has tight coupling with base primitives (PG_Idle
and page table Accessed bits), DAMON is designed to be extensible for many
use cases and address spaces. If you need some special address type or
want to use special h/w access check primitives, you can write your own
primitives for that and configure DAMON to use those. Therefore, if your
use case could be changed a lot in future, using DAMON could be better.
Can I use both Idle Page Tracking and DAMON?
--------------------------------------------
Yes, though using them concurrently for overlapping memory regions could
result in interference to each other. Nevertheless, such use case would
be rare or makes no sense at all. Even in the case, the noise would bot
be really significant. So, you can choose whatever you want depending on
the characteristics of your use cases.
More Information
================
We prepared a showcase web site[1] that you can get more information.
There are
- the official documentations[2],
- the heatmap format dynamic access pattern of various realistic workloads for
heap area[3], mmap()-ed area[4], and stack[5] area,
- the dynamic working set size distribution[6] and chronological working set
size changes[7], and
- the latest performance test results[8].
[1] https://damonitor.github.io/_index
[2] https://damonitor.github.io/doc/html/latest-damon
[3] https://damonitor.github.io/test/result/visual/latest/rec.heatmap.0.png.html
[4] https://damonitor.github.io/test/result/visual/latest/rec.heatmap.1.png.html
[5] https://damonitor.github.io/test/result/visual/latest/rec.heatmap.2.png.html
[6] https://damonitor.github.io/test/result/visual/latest/rec.wss_sz.png.html
[7] https://damonitor.github.io/test/result/visual/latest/rec.wss_time.png.html
[8] https://damonitor.github.io/test/result/perf/latest/html/index.html
Baseline and Complete Git Trees
===============================
The patches are based on the latest -mm tree, specifically
v5.14-rc1-mmots-2021-07-15-18-47 of https://github.com/hnaz/linux-mm. You can
also clone the complete git tree:
$ git clone git://github.com/sjp38/linux -b damon/patches/v34
The web is also available:
https://github.com/sjp38/linux/releases/tag/damon/patches/v34
Development Trees
-----------------
There are a couple of trees for entire DAMON patchset series and features
for future release.
- For latest release: https://github.com/sjp38/linux/tree/damon/master
- For next release: https://github.com/sjp38/linux/tree/damon/next
Long-term Support Trees
-----------------------
For people who want to test DAMON but using LTS kernels, there are another
couple of trees based on two latest LTS kernels respectively and
containing the 'damon/master' backports.
- For v5.4.y: https://github.com/sjp38/linux/tree/damon/for-v5.4.y
- For v5.10.y: https://github.com/sjp38/linux/tree/damon/for-v5.10.y
Amazon Linux Kernel Trees
-------------------------
DAMON is also merged in two public Amazon Linux kernel trees that based on
v5.4.y[1] and v5.10.y[2].
[1] https://github.com/amazonlinux/linux/tree/amazon-5.4.y/master/mm/damon
[2] https://github.com/amazonlinux/linux/tree/amazon-5.10.y/master/mm/damon
Git Tree for Diff of Patches
============================
For easy review of diff between different versions of each patch, I
prepared a git tree containing all versions of the DAMON patchset series:
https://github.com/sjp38/damon-patches
You can clone it and use 'diff' for easy review of changes between
different versions of the patchset. For example:
$ git clone https://github.com/sjp38/damon-patches && cd damon-patches
$ diff -u damon/v33 damon/v34
Sequence Of Patches
===================
First three patches implement the core logics of DAMON. The 1st patch
introduces basic sampling based hotness monitoring for arbitrary types of
targets. Following two patches implement the core mechanisms for control
of overhead and accuracy, namely regions based sampling (patch 2) and
adaptive regions adjustment (patch 3).
Now the essential parts of DAMON is complete, but it cannot work unless
someone provides monitoring primitives for a specific use case. The
following two patches make it just work for virtual address spaces
monitoring. The 4th patch makes 'PG_idle' can be used by DAMON and the
5th patch implements the virtual memory address space specific monitoring
primitives using page table Accessed bits and the 'PG_idle' page flag.
Now DAMON just works for virtual address space monitoring via the kernel
space api. To let the user space users can use DAMON, following four
patches add interfaces for them. The 6th patch adds a tracepoint for
monitoring results. The 7th patch implements a DAMON application kernel
module, namely damon-dbgfs, that simply wraps DAMON and exposes DAMON
interface to the user space via the debugfs interface. The 8th patch
further exports pid of monitoring thread (kdamond) to user space for
easier cpu usage accounting, and the 9th patch makes the debugfs interface
to support multiple contexts.
Three patches for maintainability follows. The 10th patch adds
documentations for both the user space and the kernel space. The 11th
patch provides unit tests (based on the kunit) while the 12th patch adds
user space tests (based on the kselftest).
Finally, the last patch (13th) updates the MAINTAINERS file.
This patch (of 13):
DAMON is a data access monitoring framework for the Linux kernel. The
core mechanisms of DAMON make it
- accurate (the monitoring output is useful enough for DRAM level
performance-centric memory management; It might be inappropriate for
CPU cache levels, though),
- light-weight (the monitoring overhead is normally low enough to be
applied online), and
- scalable (the upper-bound of the overhead is in constant range
regardless of the size of target workloads).
Using this framework, hence, we can easily write efficient kernel space
data access monitoring applications. For example, the kernel's memory
management mechanisms can make advanced decisions using this.
Experimental data access aware optimization works that incurring high
access monitoring overhead could again be implemented on top of this.
Due to its simple and flexible interface, providing user space interface
would be also easy. Then, user space users who have some special
workloads can write personalized applications for better understanding and
optimizations of their workloads and systems.
===
Nevertheless, this commit is defining and implementing only basic access
check part without the overhead-accuracy handling core logic. The basic
access check is as below.
The output of DAMON says what memory regions are how frequently accessed
for a given duration. The resolution of the access frequency is
controlled by setting ``sampling interval`` and ``aggregation interval``.
In detail, DAMON checks access to each page per ``sampling interval`` and
aggregates the results. In other words, counts the number of the accesses
to each region. After each ``aggregation interval`` passes, DAMON calls
callback functions that previously registered by users so that users can
read the aggregated results and then clears the results. This can be
described in below simple pseudo-code::
init()
while monitoring_on:
for page in monitoring_target:
if accessed(page):
nr_accesses[page] += 1
if time() % aggregation_interval == 0:
for callback in user_registered_callbacks:
callback(monitoring_target, nr_accesses)
for page in monitoring_target:
nr_accesses[page] = 0
if time() % update_interval == 0:
update()
sleep(sampling interval)
The target regions constructed at the beginning of the monitoring and
updated after each ``regions_update_interval``, because the target regions
could be dynamically changed (e.g., mmap() or memory hotplug). The
monitoring overhead of this mechanism will arbitrarily increase as the
size of the target workload grows.
The basic monitoring primitives for actual access check and dynamic target
regions construction aren't in the core part of DAMON. Instead, it allows
users to implement their own primitives that are optimized for their use
case and configure DAMON to use those. In other words, users cannot use
current version of DAMON without some additional works.
Following commits will implement the core mechanisms for the
overhead-accuracy control and default primitives implementations.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: SeongJae Park <[email protected]>
Reviewed-by: Leonard Foerster <[email protected]>
Reviewed-by: Fernand Sieber <[email protected]>
Acked-by: Shakeel Butt <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Amit Shah <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: David Woodhouse <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: Fan Du <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Greg Thelen <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Maximilian Heyne <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Steven Rostedt (VMware) <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Brendan Higgins <[email protected]>
Cc: Markus Boehme <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Instead of hard-coding ((1UL << NR_PAGEFLAGS) - 1) everywhere, introducing
PAGEFLAGS_MASK to make the code clear to get the page flags.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Muchun Song <[email protected]>
Reviewed-by: Roman Gushchin <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
kmap_atomic() disables preemption and pagefaults for historical reasons.
The conversion to kmap_local(), which only disables migration, cannot be
done wholesale because quite some call sites need to be updated to
accommodate with the changed semantics.
On PREEMPT_RT enabled kernels the kmap_atomic() semantics are problematic
due to the implicit disabling of preemption which makes it impossible to
acquire 'sleeping' spinlocks within the kmap atomic sections.
PREEMPT_RT replaces the preempt_disable() with a migrate_disable() for
more than a decade. It could be argued that this is a justification to do
this unconditionally, but PREEMPT_RT covers only a limited number of
architectures and it disables some functionality which limits the coverage
further.
Limit the replacement to PREEMPT_RT for now.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
early_ioremap_reset() reserved a weak function so that architectures can
provide a specific cleanup. Now no architectures use it, remove this
redundant function.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Weizhao Ouyang <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "small ioremap cleanups".
The first patch moves a little code around the vmalloc/ioremap boundary
following a bigger move by Nick earlier. The second enforces
non-executable mapping on ioremap just like we do for vmap. No driver
currently uses executable mappings anyway, as they should.
This patch (of 2):
This keeps it together with the implementation, and to remove the
vmap_range wrapper.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Nicholas Piggin <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
There is a READ_ONCE() in the macro of compound_head(), which will prevent
compiler from optimizing the code when there are more than once calling of
it in a function. Remove the redundant calling of compound_head() from
page_to_index() and page_add_file_rmap() for better code generation.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Muchun Song <[email protected]>
Reviewed-by: David Howells <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: William Kucharski <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
policy
Currently, the "auto-movable" online policy does not allow for hotplugged
KERNEL (ZONE_NORMAL) memory to increase the amount of MOVABLE memory we
can have, primarily, because there is no coordiantion across memory
devices and we don't want to create zone-imbalances accidentially when
unplugging memory.
However, within a single memory device it's different. Let's allow for
KERNEL memory within a dynamic memory group to allow for more MOVABLE
within the same memory group. The only thing we have to take care of is
that the managing driver avoids zone imbalances by unplugging MOVABLE
memory first, otherwise there can be corner cases where unplug of memory
could result in (accidential) zone imbalances.
virtio-mem is the only user of dynamic memory groups and recently added
support for prioritizing unplug of ZONE_MOVABLE over ZONE_NORMAL, so we
don't need a new toggle to enable it for dynamic memory groups.
We limit this handling to dynamic memory groups, because:
* We want to keep the runtime overhead for collecting stats when
onlining a single memory block small. We tend to have only a handful of
dynamic memory groups, but we can have quite some static memory groups
(e.g., 256 DIMMs).
* It doesn't make too much sense for static memory groups, as we try
onlining all applicable memory blocks either completely to ZONE_MOVABLE
or not. In ordinary operation, we won't have a mixture of zones within
a static memory group.
When adding memory to a dynamic memory group, we'll first online memory to
ZONE_MOVABLE as long as early KERNEL memory allows for it. Then, we'll
online the next unit(s) to ZONE_NORMAL, until we can online the next
unit(s) to ZONE_MOVABLE.
For a simple virtio-mem device with a MOVABLE:KERNEL ratio of 3:1, it will
result in a layout like:
[M][M][M][M][M][M][M][M][N][M][M][M][N][M][M][M]...
^ movable memory due to early kernel memory
^ allows for more movable memory ...
^-----^ ... here
^ allows for more movable memory ...
^-----^ ... here
While the created layout is sub-optimal when it comes to contiguous zones,
it gives us the maximum flexibility when dynamically growing/shrinking a
device; we can grow small VMs really big in small steps, and still shrink
reliably to e.g., 1/4 of the maximum VM size in this example, removing
full memory blocks along with meta data more reliably.
Mark dynamic memory groups in the xarray such that we can efficiently
iterate over them when collecting stats. In usual setups, we have one
virtio-mem device per NUMA node, and usually only a small number of NUMA
nodes.
Note: for now, there seems to be no compelling reason to make this
behavior configurable.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Hui Zhu <[email protected]>
Cc: Jason Wang <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Marek Kedzierski <[email protected]>
Cc: "Michael S. Tsirkin" <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Pankaj Gupta <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Vitaly Kuznetsov <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Wei Yang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Use memory groups to improve our "auto-movable" onlining policy:
1. For static memory groups (e.g., a DIMM), online a memory block MOVABLE
only if all other memory blocks in the group are either MOVABLE or could
be onlined MOVABLE. A DIMM will either be MOVABLE or not, not a mixture.
2. For dynamic memory groups (e.g., a virtio-mem device), online a
memory block MOVABLE only if all other memory blocks inside the
current unit are either MOVABLE or could be onlined MOVABLE. For a
virtio-mem device with a device block size with 512 MiB, all 128 MiB
memory blocks wihin a 512 MiB unit will either be MOVABLE or not, not
a mixture.
We have to pass the memory group to zone_for_pfn_range() to take the
memory group into account.
Note: for now, there seems to be no compelling reason to make this
behavior configurable.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Hui Zhu <[email protected]>
Cc: Jason Wang <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Marek Kedzierski <[email protected]>
Cc: "Michael S. Tsirkin" <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Pankaj Gupta <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Vitaly Kuznetsov <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Wei Yang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Let's track all present pages in each memory group. Especially, track
memory present in ZONE_MOVABLE and memory present in one of the kernel
zones (which really only is ZONE_NORMAL right now as memory groups only
apply to hotplugged memory) separately within a memory group, to prepare
for making smart auto-online decision for individual memory blocks within
a memory group based on group statistics.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Hui Zhu <[email protected]>
Cc: Jason Wang <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Marek Kedzierski <[email protected]>
Cc: "Michael S. Tsirkin" <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Pankaj Gupta <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Vitaly Kuznetsov <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Wei Yang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
In our "auto-movable" memory onlining policy, we want to make decisions
across memory blocks of a single memory device. Examples of memory
devices include ACPI memory devices (in the simplest case a single DIMM)
and virtio-mem. For now, we don't have a connection between a single
memory block device and the real memory device. Each memory device
consists of 1..X memory block devices.
Let's logically group memory blocks belonging to the same memory device in
"memory groups". Memory groups can span multiple physical ranges and a
memory group itself does not contain any information regarding physical
ranges, only properties (e.g., "max_pages") necessary for improved memory
onlining.
Introduce two memory group types:
1) Static memory group: E.g., a single ACPI memory device, consisting
of 1..X memory resources. A memory group consists of 1..Y memory
blocks. The whole group is added/removed in one go. If any part
cannot get offlined, the whole group cannot be removed.
2) Dynamic memory group: E.g., a single virtio-mem device. Memory is
dynamically added/removed in a fixed granularity, called a "unit",
consisting of 1..X memory blocks. A unit is added/removed in one go.
If any part of a unit cannot get offlined, the whole unit cannot be
removed.
In case of 1) we usually want either all memory managed by ZONE_MOVABLE or
none. In case of 2) we usually want to have as many units as possible
managed by ZONE_MOVABLE. We want a single unit to be of the same type.
For now, memory groups are an internal concept that is not exposed to user
space; we might want to change that in the future, though.
add_memory() users can specify a mgid instead of a nid when passing the
MHP_NID_IS_MGID flag.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Hui Zhu <[email protected]>
Cc: Jason Wang <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Marek Kedzierski <[email protected]>
Cc: "Michael S. Tsirkin" <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Pankaj Gupta <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Vitaly Kuznetsov <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Wei Yang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "mm/memory_hotplug: "auto-movable" online policy and memory groups", v3.
I. Goal
The goal of this series is improving in-kernel auto-online support. It
tackles the fundamental problems that:
1) We can create zone imbalances when onlining all memory blindly to
ZONE_MOVABLE, in the worst case crashing the system. We have to know
upfront how much memory we are going to hotplug such that we can
safely enable auto-onlining of all hotplugged memory to ZONE_MOVABLE
via "online_movable". This is far from practical and only applicable in
limited setups -- like inside VMs under the RHV/oVirt hypervisor which
will never hotplug more than 3 times the boot memory (and the
limitation is only in place due to the Linux limitation).
2) We see more setups that implement dynamic VM resizing, hot(un)plugging
memory to resize VM memory. In these setups, we might hotplug a lot of
memory, but it might happen in various small steps in both directions
(e.g., 2 GiB -> 8 GiB -> 4 GiB -> 16 GiB ...). virtio-mem is the
primary driver of this upstream right now, performing such dynamic
resizing NUMA-aware via multiple virtio-mem devices.
Onlining all hotplugged memory to ZONE_NORMAL means we basically have
no hotunplug guarantees. Onlining all to ZONE_MOVABLE means we can
easily run into zone imbalances when growing a VM. We want a mixture,
and we want as much memory as reasonable/configured in ZONE_MOVABLE.
Details regarding zone imbalances can be found at [1].
3) Memory devices consist of 1..X memory block devices, however, the
kernel doesn't really track the relationship. Consequently, also user
space has no idea. We want to make per-device decisions.
As one example, for memory hotunplug it doesn't make sense to use a
mixture of zones within a single DIMM: we want all MOVABLE if
possible, otherwise all !MOVABLE, because any !MOVABLE part will easily
block the whole DIMM from getting hotunplugged.
As another example, virtio-mem operates on individual units that span
1..X memory blocks. Similar to a DIMM, we want a unit to either be all
MOVABLE or !MOVABLE. A "unit" can be thought of like a DIMM, however,
all units of a virtio-mem device logically belong together and are
managed (added/removed) by a single driver. We want as much memory of
a virtio-mem device to be MOVABLE as possible.
4) We want memory onlining to be done right from the kernel while adding
memory, not triggered by user space via udev rules; for example, this
is reqired for fast memory hotplug for drivers that add individual
memory blocks, like virito-mem. We want a way to configure a policy in
the kernel and avoid implementing advanced policies in user space.
The auto-onlining support we have in the kernel is not sufficient. All we
have is a) online everything MOVABLE (online_movable) b) online everything
!MOVABLE (online_kernel) c) keep zones contiguous (online). This series
allows configuring c) to mean instead "online movable if possible
according to the coniguration, driven by a maximum MOVABLE:KERNEL ratio"
-- a new onlining policy.
II. Approach
This series does 3 things:
1) Introduces the "auto-movable" online policy that initially operates on
individual memory blocks only. It uses a maximum MOVABLE:KERNEL ratio
to make a decision whether a memory block will be onlined to
ZONE_MOVABLE or not. However, in the basic form, hotplugged KERNEL
memory does not allow for more MOVABLE memory (details in the
patches). CMA memory is treated like MOVABLE memory.
2) Introduces static (e.g., DIMM) and dynamic (e.g., virtio-mem) memory
groups and uses group information to make decisions in the
"auto-movable" online policy across memory blocks of a single memory
device (modeled as memory group). More details can be found in patch
#3 or in the DIMM example below.
3) Maximizes ZONE_MOVABLE memory within dynamic memory groups, by
allowing ZONE_NORMAL memory within a dynamic memory group to allow for
more ZONE_MOVABLE memory within the same memory group. The target use
case is dynamic VM resizing using virtio-mem. See the virtio-mem
example below.
I remember that the basic idea of using a ratio to implement a policy in
the kernel was once mentioned by Vitaly Kuznetsov, but I might be wrong (I
lost the pointer to that discussion).
For me, the main use case is using it along with virtio-mem (and DIMMs /
ppc64 dlpar where necessary) for dynamic resizing of VMs, increasing the
amount of memory we can hotunplug reliably again if we might eventually
hotplug a lot of memory to a VM.
III. Target Usage
The target usage will be:
1) Linux boots with "mhp_default_online_type=offline"
2) User space (e.g., systemd unit) configures memory onlining (according
to a config file and system properties), for example:
* Setting memory_hotplug.online_policy=auto-movable
* Setting memory_hotplug.auto_movable_ratio=301
* Setting memory_hotplug.auto_movable_numa_aware=true
3) User space enabled auto onlining via "echo online >
/sys/devices/system/memory/auto_online_blocks"
4) User space triggers manual onlining of all already-offline memory
blocks (go over offline memory blocks and set them to "online")
IV. Example
For DIMMs, hotplugging 4 GiB DIMMs to a 4 GiB VM with a configured ratio of
301% results in the following layout:
Memory block 0-15: DMA32 (early)
Memory block 32-47: Normal (early)
Memory block 48-79: Movable (DIMM 0)
Memory block 80-111: Movable (DIMM 1)
Memory block 112-143: Movable (DIMM 2)
Memory block 144-275: Normal (DIMM 3)
Memory block 176-207: Normal (DIMM 4)
... all Normal
(-> hotplugged Normal memory does not allow for more Movable memory)
For virtio-mem, using a simple, single virtio-mem device with a 4 GiB VM
will result in the following layout:
Memory block 0-15: DMA32 (early)
Memory block 32-47: Normal (early)
Memory block 48-143: Movable (virtio-mem, first 12 GiB)
Memory block 144: Normal (virtio-mem, next 128 MiB)
Memory block 145-147: Movable (virtio-mem, next 384 MiB)
Memory block 148: Normal (virtio-mem, next 128 MiB)
Memory block 149-151: Movable (virtio-mem, next 384 MiB)
... Normal/Movable mixture as above
(-> hotplugged Normal memory allows for more Movable memory within
the same device)
Which gives us maximum flexibility when dynamically growing/shrinking a
VM in smaller steps.
V. Doc Update
I'll update the memory-hotplug.rst documentation, once the overhaul [1] is
usptream. Until then, details can be found in patch #2.
VI. Future Work
1) Use memory groups for ppc64 dlpar
2) Being able to specify a portion of (early) kernel memory that will be
excluded from the ratio. Like "128 MiB globally/per node" are excluded.
This might be helpful when starting VMs with extremely small memory
footprint (e.g., 128 MiB) and hotplugging memory later -- not wanting
the first hotplugged units getting onlined to ZONE_MOVABLE. One
alternative would be a trigger to not consider ZONE_DMA memory
in the ratio. We'll have to see if this is really rrequired.
3) Indicate to user space that MOVABLE might be a bad idea -- especially
relevant when memory ballooning without support for balloon compaction
is active.
This patch (of 9):
For implementing a new memory onlining policy, which determines when to
online memory blocks to ZONE_MOVABLE semi-automatically, we need the
number of present early (boot) pages -- present pages excluding hotplugged
pages. Let's track these pages per zone.
Pass a page instead of the zone to adjust_present_page_count(), similar as
adjust_managed_page_count() and derive the zone from the page.
It's worth noting that a memory block to be offlined/onlined is either
completely "early" or "not early". add_memory() and friends can only add
complete memory blocks and we only online/offline complete (individual)
memory blocks.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Cc: Vitaly Kuznetsov <[email protected]>
Cc: "Michael S. Tsirkin" <[email protected]>
Cc: Jason Wang <[email protected]>
Cc: Marek Kedzierski <[email protected]>
Cc: Hui Zhu <[email protected]>
Cc: Pankaj Gupta <[email protected]>
Cc: Wei Yang <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
There is only a single user remaining. We can simply lookup the nid only
used for node offlining purposes when walking our memory blocks. We don't
expect to remove multi-nid ranges; and if we'd ever do, we most probably
don't care about removing multi-nid ranges that actually result in empty
nodes.
If ever required, we can detect the "multi-nid" scenario and simply try
offlining all online nodes.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Acked-by: Michael Ellerman <[email protected]> (powerpc)
Cc: Michael Ellerman <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: "Michael S. Tsirkin" <[email protected]>
Cc: Jason Wang <[email protected]>
Cc: Nathan Lynch <[email protected]>
Cc: Laurent Dufour <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>
Cc: Scott Cheloha <[email protected]>
Cc: Anton Blanchard <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Ard Biesheuvel <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jia He <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Pankaj Gupta <[email protected]>
Cc: Pankaj Gupta <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Pierre Morel <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Sergei Trofimovich <[email protected]>
Cc: Thiago Jung Bauermann <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Vitaly Kuznetsov <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Wei Yang <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The parameter is unused, let's remove it.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
Acked-by: Michael Ellerman <[email protected]> [powerpc]
Acked-by: Heiko Carstens <[email protected]> [s390]
Reviewed-by: Pankaj Gupta <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Ard Biesheuvel <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Laurent Dufour <[email protected]>
Cc: Sergei Trofimovich <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>
Cc: Thiago Jung Bauermann <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: Pierre Morel <[email protected]>
Cc: Jia He <[email protected]>
Cc: Anton Blanchard <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Jason Wang <[email protected]>
Cc: Len Brown <[email protected]>
Cc: "Michael S. Tsirkin" <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Nathan Lynch <[email protected]>
Cc: Pankaj Gupta <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Scott Cheloha <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: Vitaly Kuznetsov <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Wei Yang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "mm/memory_hotplug: preparatory patches for new online policy and memory"
These are all cleanups and one fix previously sent as part of [1]:
[PATCH v1 00/12] mm/memory_hotplug: "auto-movable" online policy and memory
groups.
These patches make sense even without the other series, therefore I pulled
them out to make the other series easier to digest.
[1] https://lkml.kernel.org/r/[email protected]
This patch (of 4):
Checkpatch complained on a follow-up patch that we are using "unsigned"
here, which defaults to "unsigned int" and checkpatch is correct.
As we will search for a fitting zone using the wrong pfn, we might end
up onlining memory to one of the special kernel zones, such as ZONE_DMA,
which can end badly as the onlined memory does not satisfy properties of
these zones.
Use "unsigned long" instead, just as we do in other places when handling
PFNs. This can bite us once we have physical addresses in the range of
multiple TB.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: e5e689302633 ("mm, memory_hotplug: display allowed zones in the preferred ordering")
Signed-off-by: David Hildenbrand <[email protected]>
Reviewed-by: Pankaj Gupta <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Vitaly Kuznetsov <[email protected]>
Cc: "Michael S. Tsirkin" <[email protected]>
Cc: Jason Wang <[email protected]>
Cc: Pankaj Gupta <[email protected]>
Cc: Wei Yang <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: [email protected]
Cc: Andy Lutomirski <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>
Cc: Anton Blanchard <[email protected]>
Cc: Ard Biesheuvel <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jia He <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Laurent Dufour <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Nathan Lynch <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Pierre Morel <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Scott Cheloha <[email protected]>
Cc: Sergei Trofimovich <[email protected]>
Cc: Thiago Jung Bauermann <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE".
After recent updates to freeing unused parts of the memory map, no
architecture can have holes in the memory map within a pageblock. This
makes pfn_valid_within() check and CONFIG_HOLES_IN_ZONE configuration
option redundant.
The first patch removes them both in a mechanical way and the second patch
simplifies memory_hotplug::test_pages_in_a_zone() that had
pfn_valid_within() surrounded by more logic than simple if.
This patch (of 2):
After recent changes in freeing of the unused parts of the memory map and
rework of pfn_valid() in arm and arm64 there are no architectures that can
have holes in the memory map within a pageblock and so nothing can enable
CONFIG_HOLES_IN_ZONE which guards non trivial implementation of
pfn_valid_within().
With that, pfn_valid_within() is always hardwired to 1 and can be
completely removed.
Remove calls to pfn_valid_within() and CONFIG_HOLES_IN_ZONE.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
timespec64_ns() prevents multiplication overflows by comparing the seconds
value of the timespec to KTIME_SEC_MAX. If the value is greater or equal it
returns KTIME_MAX.
But that check casts the signed seconds value to unsigned which makes the
comparision true for all negative values and therefore return wrongly
KTIME_MAX.
Negative second values are perfectly valid and required in some places,
e.g. ptp_clock_adjtime().
Remove the cast and add a check for the negative boundary which is required
to prevent undefined behaviour due to multiplication underflow.
Fixes: cb47755725da ("time: Prevent undefined behaviour in timespec64_to_ns()")'
Signed-off-by: Lukas Hannen <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/r/AM6PR01MB541637BD6F336B8FFB72AF80EEC69@AM6PR01MB5416.eurprd01.prod.exchangelabs.com
|
|
* pm-cpufreq:
Revert "cpufreq: intel_pstate: Process HWP Guaranteed change notification"
cpufreq: mediatek-hw: Add support for CPUFREQ HW
cpufreq: Add of_perf_domain_get_sharing_cpumask
dt-bindings: cpufreq: add bindings for MediaTek cpufreq HW
cpufreq: Remove ready() callback
cpufreq: sh: Remove sh_cpufreq_cpu_ready()
cpufreq: acpi: Remove acpi_cpufreq_cpu_ready()
cpufreq: qcom-hw: Set dvfs_possible_from_any_cpu cpufreq driver flag
cpufreq: blocklist more Qualcomm platforms in cpufreq-dt-platdev
cpufreq: qcom-cpufreq-hw: Add dcvs interrupt support
cpufreq: scmi: Use .register_em() to register with energy model
cpufreq: vexpress: Use .register_em() to register with energy model
cpufreq: scpi: Use .register_em() to register with energy model
cpufreq: qcom-cpufreq-hw: Use .register_em() to register with energy model
cpufreq: omap: Use .register_em() to register with energy model
cpufreq: mediatek: Use .register_em() to register with energy model
cpufreq: imx6q: Use .register_em() to register with energy model
cpufreq: dt: Use .register_em() to register with energy model
cpufreq: Add callback to register with energy model
cpufreq: vexpress: Set CPUFREQ_IS_COOLING_DEV flag
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull PCI updates from Bjorn Helgaas:
"Enumeration:
- Convert controller drivers to generic_handle_domain_irq() (Marc
Zyngier)
- Simplify VPD (Vital Product Data) access and search (Heiner
Kallweit)
- Update bnx2, bnx2x, bnxt, cxgb4, cxlflash, sfc, tg3 drivers to use
simplified VPD interfaces (Heiner Kallweit)
- Run Max Payload Size quirks before configuring MPS; work around
ASMedia ASM1062 SATA MPS issue (Marek Behún)
Resource management:
- Refactor pci_ioremap_bar() and pci_ioremap_wc_bar() (Krzysztof
Wilczyński)
- Optimize pci_resource_len() to reduce kernel size (Zhen Lei)
PCI device hotplug:
- Fix a double unmap in ibmphp (Vishal Aslot)
PCIe port driver:
- Enable Bandwidth Notification only if port supports it (Stuart
Hayes)
Sysfs/proc/syscalls:
- Add schedule point in proc_bus_pci_read() (Krzysztof Wilczyński)
- Return ~0 data on pciconfig_read() CAP_SYS_ADMIN failure (Krzysztof
Wilczyński)
- Return "int" from pciconfig_read() syscall (Krzysztof Wilczyński)
Virtualization:
- Extend "pci=noats" to also turn on Translation Blocking to protect
against some DMA attacks (Alex Williamson)
- Add sysfs mechanism to control the type of reset used between
device assignments to VMs (Amey Narkhede)
- Add support for ACPI _RST reset method (Shanker Donthineni)
- Add ACS quirks for Cavium multi-function devices (George Cherian)
- Add ACS quirks for NXP LX2xx0 and LX2xx2 platforms (Wasim Khan)
- Allow HiSilicon AMBA devices that appear as fake PCI devices to use
PASID and SVA (Zhangfei Gao)
Endpoint framework:
- Add support for SR-IOV Endpoint devices (Kishon Vijay Abraham I)
- Zero-initialize endpoint test tool parameters so we don't use
random parameters (Shunyong Yang)
APM X-Gene PCIe controller driver:
- Remove redundant dev_err() call in xgene_msi_probe() (ErKun Yang)
Broadcom iProc PCIe controller driver:
- Don't fail devm_pci_alloc_host_bridge() on missing 'ranges' because
it's optional on BCMA devices (Rob Herring)
- Fix BCMA probe resource handling (Rob Herring)
Cadence PCIe driver:
- Work around J7200 Link training electrical issue by increasing
delays in LTSSM (Nadeem Athani)
Intel IXP4xx PCI controller driver:
- Depend on ARCH_IXP4XX to avoid useless config questions (Geert
Uytterhoeven)
Intel Keembay PCIe controller driver:
- Add Intel Keem Bay PCIe controller (Srikanth Thokala)
Marvell Aardvark PCIe controller driver:
- Work around config space completion handling issues (Evan Wang)
- Increase timeout for config access completions (Pali Rohár)
- Emulate CRS Software Visibility bit (Pali Rohár)
- Configure resources from DT 'ranges' property to fix I/O space
access (Pali Rohár)
- Serialize INTx mask/unmask (Pali Rohár)
MediaTek PCIe controller driver:
- Add MT7629 support in DT (Chuanjia Liu)
- Fix an MSI issue (Chuanjia Liu)
- Get syscon regmap ("mediatek,generic-pciecfg"), IRQ number
("pci_irq"), PCI domain ("linux,pci-domain") from DT properties if
present (Chuanjia Liu)
Microsoft Hyper-V host bridge driver:
- Add ARM64 support (Boqun Feng)
- Support "Create Interrupt v3" message (Sunil Muthuswamy)
NVIDIA Tegra PCIe controller driver:
- Use seq_puts(), move err_msg from stack to static, fix OF node leak
(Christophe JAILLET)
NVIDIA Tegra194 PCIe driver:
- Disable suspend when in Endpoint mode (Om Prakash Singh)
- Fix MSI-X address programming error (Om Prakash Singh)
- Disable interrupts during suspend to avoid spurious AER link down
(Om Prakash Singh)
Renesas R-Car PCIe controller driver:
- Work around hardware issue that prevents Link L1->L0 transition
(Marek Vasut)
- Fix runtime PM refcount leak (Dinghao Liu)
Rockchip DesignWare PCIe controller driver:
- Add Rockchip RK356X host controller driver (Simon Xue)
TI J721E PCIe driver:
- Add support for J7200 and AM64 (Kishon Vijay Abraham I)
Toshiba Visconti PCIe controller driver:
- Add Toshiba Visconti PCIe host controller driver (Nobuhiro
Iwamatsu)
Xilinx NWL PCIe controller driver:
- Enable PCIe reference clock via CCF (Hyun Kwon)
Miscellaneous:
- Convert sta2x11 from 'pci_' to 'dma_' API (Christophe JAILLET)
- Fix pci_dev_str_match_path() alloc while atomic bug (used for
kernel parameters that specify devices) (Dan Carpenter)
- Remove pointless Precision Time Management warning when PTM is
present but not enabled (Jakub Kicinski)
- Remove surplus "break" statements (Krzysztof Wilczyński)"
* tag 'pci-v5.15-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (132 commits)
PCI: ibmphp: Fix double unmap of io_mem
x86/PCI: sta2x11: switch from 'pci_' to 'dma_' API
PCI/VPD: Use unaligned access helpers
PCI/VPD: Clean up public VPD defines and inline functions
cxgb4: Use pci_vpd_find_id_string() to find VPD ID string
PCI/VPD: Add pci_vpd_find_id_string()
PCI/VPD: Include post-processing in pci_vpd_find_tag()
PCI/VPD: Stop exporting pci_vpd_find_info_keyword()
PCI/VPD: Stop exporting pci_vpd_find_tag()
PCI: Set dma-can-stall for HiSilicon chips
PCI: rockchip-dwc: Add Rockchip RK356X host controller driver
PCI: dwc: Remove surplus break statement after return
PCI: artpec6: Remove local code block from switch statement
PCI: artpec6: Remove surplus break statement after return
MAINTAINERS: Add entries for Toshiba Visconti PCIe controller
PCI: visconti: Add Toshiba Visconti PCIe host controller driver
PCI/portdrv: Enable Bandwidth Notification only if port supports it
PCI: Allow PASID on fake PCIe devices without TLP prefixes
PCI: mediatek: Use PCI domain to handle ports detection
PCI: mediatek: Add new method to get irq number
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes and stragglers from Jakub Kicinski:
"Networking stragglers and fixes, including changes from netfilter,
wireless and can.
Current release - regressions:
- qrtr: revert check in qrtr_endpoint_post(), fixes audio and wifi
- ip_gre: validate csum_start only on pull
- bnxt_en: fix 64-bit doorbell operation on 32-bit kernels
- ionic: fix double use of queue-lock, fix a sleeping in atomic
- can: c_can: fix null-ptr-deref on ioctl()
- cs89x0: disable compile testing on powerpc
Current release - new code bugs:
- bridge: mcast: fix vlan port router deadlock, consistently disable
BH
Previous releases - regressions:
- dsa: tag_rtl4_a: fix egress tags, only port 0 was working
- mptcp: fix possible divide by zero
- netfilter: nft_ct: protect nft_ct_pcpu_template_refcnt with mutex
- netfilter: socket: icmp6: fix use-after-scope
- stmmac: fix MAC not working when system resume back with WoL active
Previous releases - always broken:
- ip/ip6_gre: use the same logic as SIT interfaces when computing
v6LL address
- seg6: set fc_nlinfo in nh_create_ipv4, nh_create_ipv6
- mptcp: only send extra TCP acks in eligible socket states
- dsa: lantiq_gswip: fix maximum frame length
- stmmac: fix overall budget calculation for rxtx_napi
- bnxt_en: fix firmware version reporting via devlink
- renesas: sh_eth: add missing barrier to fix freeing wrong tx
descriptor
Stragglers:
- netfilter: conntrack: switch to siphash
- netfilter: refuse insertion if chain has grown too large
- ncsi: add get MAC address command to get Intel i210 MAC address"
* tag 'net-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (76 commits)
ieee802154: Remove redundant initialization of variable ret
net: stmmac: fix MAC not working when system resume back with WoL active
net: phylink: add suspend/resume support
net: renesas: sh_eth: Fix freeing wrong tx descriptor
bonding: 3ad: pass parameter bond_params by reference
cxgb3: fix oops on module removal
can: c_can: fix null-ptr-deref on ioctl()
can: rcar_canfd: add __maybe_unused annotation to silence warning
net: wwan: iosm: Unify IO accessors used in the driver
net: wwan: iosm: Replace io.*64_lo_hi() with regular accessors
net: qcom/emac: Replace strlcpy with strscpy
ip6_gre: Revert "ip6_gre: add validation for csum_start"
net: hns3: make hclgevf_cmd_caps_bit_map0 and hclge_cmd_caps_bit_map0 static
selftests/bpf: Test XDP bonding nest and unwind
bonding: Fix negative jump label count on nested bonding
MAINTAINERS: add VM SOCKETS (AF_VSOCK) entry
stmmac: dwmac-loongson:Fix missing return value
iwlwifi: fix printk format warnings in uefi.c
net: create netdev->dev_addr assignment helpers
bnxt_en: Fix possible unintended driver initiated error recovery
...
|
|
git://www.linux-watchdog.org/linux-watchdog
Pull watchdog updates from Wim Van Sebroeck:
- add Mediatek MT7986 & MT8195 wdt support
- add Maxim MAX63xx
- drop bd70528 support
- rewrite ixp4xx to watchdog framework
- constify static struct watchdog_ops for sl28cpld_wdt, mpc8xxx_wdt and
tqmx86
- introduce watchdog_dev_suspend/resume
- several fixes and improvements
* tag 'linux-watchdog-5.15-rc1' of git://www.linux-watchdog.org/linux-watchdog:
dt-bindings: watchdog: Add compatible for Mediatek MT7986
watchdog: ixp4xx: Rewrite driver to use core
watchdog: Start watchdog in watchdog_set_last_hw_keepalive only if appropriate
watchdog: max63xx_wdt: Add device tree probing
dt-bindings: watchdog: Add Maxim MAX63xx bindings
watchdog: mediatek: mt8195: add wdt support
dt-bindings: reset: mt8195: add toprgu reset-controller header file
watchdog: tqmx86: Constify static struct watchdog_ops
watchdog: mpc8xxx_wdt: Constify static struct watchdog_ops
watchdog: sl28cpld_wdt: Constify static struct watchdog_ops
watchdog: iTCO_wdt: Fix detection of SMI-off case
watchdog: bcm2835_wdt: consider system-power-controller property
watchdog: imx2_wdg: notify wdog core to stop ping worker on suspend
watchdog: introduce watchdog_dev_suspend/resume
watchdog: Fix NULL pointer dereference when releasing cdev
watchdog: only run driver set_pretimeout op if device supports it
watchdog: bd70528 drop bd70528 support
|
|
Pull KVM updates from Paolo Bonzini:
"ARM:
- Page ownership tracking between host EL1 and EL2
- Rely on userspace page tables to create large stage-2 mappings
- Fix incompatibility between pKVM and kmemleak
- Fix the PMU reset state, and improve the performance of the virtual
PMU
- Move over to the generic KVM entry code
- Address PSCI reset issues w.r.t. save/restore
- Preliminary rework for the upcoming pKVM fixed feature
- A bunch of MM cleanups
- a vGIC fix for timer spurious interrupts
- Various cleanups
s390:
- enable interpretation of specification exceptions
- fix a vcpu_idx vs vcpu_id mixup
x86:
- fast (lockless) page fault support for the new MMU
- new MMU now the default
- increased maximum allowed VCPU count
- allow inhibit IRQs on KVM_RUN while debugging guests
- let Hyper-V-enabled guests run with virtualized LAPIC as long as
they do not enable the Hyper-V "AutoEOI" feature
- fixes and optimizations for the toggling of AMD AVIC (virtualized
LAPIC)
- tuning for the case when two-dimensional paging (EPT/NPT) is
disabled
- bugfixes and cleanups, especially with respect to vCPU reset and
choosing a paging mode based on CR0/CR4/EFER
- support for 5-level page table on AMD processors
Generic:
- MMU notifier invalidation callbacks do not take mmu_lock unless
necessary
- improved caching of LRU kvm_memory_slot
- support for histogram statistics
- add statistics for halt polling and remote TLB flush requests"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (210 commits)
KVM: Drop unused kvm_dirty_gfn_invalid()
KVM: x86: Update vCPU's hv_clock before back to guest when tsc_offset is adjusted
KVM: MMU: mark role_regs and role accessors as maybe unused
KVM: MIPS: Remove a "set but not used" variable
x86/kvm: Don't enable IRQ when IRQ enabled in kvm_wait
KVM: stats: Add VM stat for remote tlb flush requests
KVM: Remove unnecessary export of kvm_{inc,dec}_notifier_count()
KVM: x86/mmu: Move lpage_disallowed_link further "down" in kvm_mmu_page
KVM: x86/mmu: Relocate kvm_mmu_page.tdp_mmu_page for better cache locality
Revert "KVM: x86: mmu: Add guest physical address check in translate_gpa()"
KVM: x86/mmu: Remove unused field mmio_cached in struct kvm_mmu_page
kvm: x86: Increase KVM_SOFT_MAX_VCPUS to 710
kvm: x86: Increase MAX_VCPUS to 1024
kvm: x86: Set KVM_MAX_VCPU_ID to 4*KVM_MAX_VCPUS
KVM: VMX: avoid running vmx_handle_exit_irqoff in case of emulation
KVM: x86/mmu: Don't freak out if pml5_root is NULL on 4-level host
KVM: s390: index kvm->arch.idle_mask by vcpu_idx
KVM: s390: Enable specification exception interpretation
KVM: arm64: Trim guest debug exception handling
KVM: SVM: Add 5-level page table support for SVM
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/andersson/remoteproc
Pull remoteproc updates from Bjorn Andersson:
- move the crash recovery worker to the freezable work queue to avoid
interaction with other drivers during suspend & resume
- fix a couple of typos in comments
- add support for handling the audio DSP on SDM660
- fix a race between the Qualcomm wireless subsystem driver and the
associated driver for the RF chip
* tag 'rproc-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/andersson/remoteproc:
remoteproc: q6v5_pas: Add sdm660 ADSP PIL compatible
dt-bindings: remoteproc: qcom: adsp: Add SDM660 ADSP
remoteproc: use freezable workqueue for crash notifications
remoteproc: fix kernel doc for struct rproc_ops
remoteproc: fix an typo in fw_elf_get_class code comments
remoteproc: qcom: wcnss: Fix race with iris probe
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd
Pull MFD updates from Lee Jones:
"Core Frameworks:
- Add support for registering devices via MFD cells to Simple MFD (I2C)
New Drivers:
- Add support for Renesas Synchronization Management Unit (SMU)
New Device Support:
- Add support for N5010 to Intel M10 BMC
- Add support for Cannon Lake to Intel LPSS ACPI
- Add support for Samsung SSG{1,2} to ST-Ericsson's U8500 family
- Add support for TQMx110EB and TQMxE40x to TQ-Systems PLD TQMx86
New Functionality:
- Add support for GPIO to Intel LPC ICH
- Add support for Reset to Texas Instruments TPS65086
Fix-ups:
- Trivial, sorting, whitespace, renaming, etc; mt6360-core, db8500-prcmu-regs, tqmx86
- Device Tree fiddling; syscon, axp20x, qcom,pm8008, ti,tps65086, brcm,cru
- Use proper APIs for IRQ map resolution; ab8500-core, stmpe, tc3589x, wm8994-irq
- Pass 'supplied-from' property through axp288_fuel_gauge via swnode
- Remove unused file entry; MAINTAINERS
- Make interrupt line optional; tps65086
- Rename db8500-cpuidle driver symbol; db8500-prcmu
- Remove support for unused hardware; tqmx86
- Provide a standard LPC clock frequency for unknown boards; tqmx86
- Remove unused code; ti_am335x_tscadc
- Use of_iomap() instead of ioremap(); syscon
Bug Fixes:
- Clear GPIO IRQ resource flags when no IRQ is set; tqmx86
- Fix incorrect/misleading frequencies; db8500-prcmu
- Mitigate namespace clash with other GPIOBASE users"
* tag 'mfd-next-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd: (31 commits)
mfd: lpc_sch: Rename GPIOBASE to prevent build error
mfd: syscon: Use of_iomap() instead of ioremap()
dt-bindings: mfd: Add Broadcom CRU
mfd: ti_am335x_tscadc: Delete superfluous error message
mfd: tqmx86: Assume 24MHz LPC clock for unknown boards
mfd: tqmx86: Add support for TQ-Systems DMI IDs
mfd: tqmx86: Add support for TQMx110EB and TQMxE40x
mfd: tqmx86: Fix typo in "platform"
mfd: tqmx86: Remove incorrect TQMx90UC board ID
mfd: tqmx86: Clear GPIO IRQ resource when no IRQ is set
mfd: simple-mfd-i2c: Add support for registering devices via MFD cells
mfd/cpuidle: ux500: Rename driver symbol
mfd: tps65086: Add cell entry for reset driver
mfd: tps65086: Make interrupt line optional
dt-bindings: mfd: Convert tps65086.txt to YAML
MAINTAINERS: Adjust ARM/NOMADIK/Ux500 ARCHITECTURES to file renaming
mfd: db8500-prcmu: Handle missing FW variant
mfd: db8500-prcmu: Rename register header
mfd: axp20x: Add supplied-from property to axp288_fuel_gauge cell
mfd: Don't use irq_create_mapping() to resolve a mapping
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux
Pull gpio updates from Bartosz Golaszewski:
"We mostly have various improvements and refactoring all over the place
but also some interesting new features - like the virtio GPIO driver
that allows guest VMs to use host's GPIOs. We also have a new/old GPIO
driver for rockchip - this one has been split out of the pinctrl
driver.
Summary:
- new driver: gpio-virtio allowing a guest VM running linux to access
GPIO lines provided by the host
- split the GPIO driver out of the rockchip pin control driver
- add support for a new model to gpio-aspeed-sgpio, refactor the
driver and use generic device property interfaces, improve property
sanitization
- add ACPI support to gpio-tegra186
- improve the code setting the line names to support multiple GPIO
banks per device
- constify a bunch of OF functions in the core GPIO code and make the
declaration for one of the core OF functions we use consistent
within its header
- use software nodes in intel_quark_i2c_gpio
- add support for the gpio-line-names property in gpio-mt7621
- use the standard GPIO function for setting the GPIO names in
gpio-brcmstb
- fix a bunch of leaks and other bugs in gpio-mpc8xxx
- use generic pm callbacks in gpio-ml-ioh
- improve resource management and PM handling in gpio-mlxbf2
- modernize and improve the gpio-dwapb driver
- coding style improvements in gpio-rcar
- documentation fixes and improvements
- update the MAINTAINERS entry for gpio-zynq
- minor tweaks in several drivers"
* tag 'gpio-updates-for-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux: (35 commits)
gpio: mpc8xxx: Use 'devm_gpiochip_add_data()' to simplify the code and avoid a leak
gpio: mpc8xxx: Fix a potential double iounmap call in 'mpc8xxx_probe()'
gpio: mpc8xxx: Fix a resources leak in the error handling path of 'mpc8xxx_probe()'
gpio: viperboard: remove platform_set_drvdata() call in probe
gpio: virtio: Add missing mailings lists in MAINTAINERS entry
gpio: virtio: Fix sparse warnings
gpio: remove the obsolete MX35 3DS BOARD MC9S08DZ60 GPIO functions
gpio: max730x: Use the right include
gpio: Add virtio-gpio driver
gpio: mlxbf2: Use DEFINE_RES_MEM_NAMED() helper macro
gpio: mlxbf2: Use devm_platform_ioremap_resource()
gpio: mlxbf2: Drop wrong use of ACPI_PTR()
gpio: mlxbf2: Convert to device PM ops
gpio: dwapb: Get rid of legacy platform data
mfd: intel_quark_i2c_gpio: Convert GPIO to use software nodes
gpio: dwapb: Read GPIO base from gpio-base property
gpio: dwapb: Unify ACPI enumeration checks in get_irq() and configure_irqs()
gpiolib: Deduplicate forward declaration in the consumer.h header
MAINTAINERS: update gpio-zynq.yaml reference
gpio: tegra186: Add ACPI support
...
|
|
Fix the kernel-doc comments for the improved Energy Model documentation.
Signed-off-by: Lukasz Luba <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>
|
|
On some systems the nominal_perf value retrieved via CPPC is just
a constant and fetching it doesn't require accessing any registers,
so if it is the only CPPC capability that's needed, it is wasteful
to run cppc_get_perf_caps() in order to get just that value alone,
especially when this is done for CPUs other than the one running
the code.
For this reason, introduce cppc_get_nominal_perf() allowing
nominal_perf to be obtained individually, by generalizing the
existing cppc_get_desired_perf() (and renaming it) so it can be
used to retrieve any specific CPPC capability value.
While at it, clean up the cppc_get_desired_perf() kerneldoc comment
a bit.
Signed-off-by: Rafael J. Wysocki <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux
Pull kgdb updates from Daniel Thompson:
"Changes for kgdb/kdb this cycle are dominated by a change from Sumit
that removes as small (256K) private heap from kdb. This is change
I've hoped for ever since I discovered how few users of this heap
remained in the kernel, so many thanks to Sumit for hunting these
down.
The other change is an incremental step towards SPDX headers"
* tag 'kgdb-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux:
kernel: debug: Convert to SPDX identifier
kdb: Rename members of struct kdbtab_t
kdb: Simplify kdb_defcmd macro logic
kdb: Get rid of redundant kdb_register_flags()
kdb: Rename struct defcmd_set to struct kdb_macro
kdb: Get rid of custom debug heap allocator
|
|
Fix unused-const-variable warnings emitted by gcc when cxlmem.h is used
by pretty much all files except pci.c
Signed-off-by: Ben Widawsky <[email protected]>
Reviewed-by: Jonathan Cameron <[email protected]>
Link: https://lore.kernel.org/r/163072205652.2250120.16833548560832424468.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <[email protected]>
|
|
This reverts commit 9857a17f206ff374aea78bccfb687f145368be2e.
That commit was completely broken, and I should have caught on to it
earlier. But happily, the kernel test robot noticed the breakage fairly
quickly.
The breakage is because "try_get_page()" is about avoiding the page
reference count overflow case, but is otherwise the exact same as a
plain "get_page()".
In contrast, "try_get_compound_head()" is an entirely different beast,
and uses __page_cache_add_speculative() because it's not just about the
page reference count, but also about possibly racing with the underlying
page going away.
So all the commentary about how
"try_get_page() has fallen a little behind in terms of maintenance,
try_get_compound_head() handles speculative page references more
thoroughly"
was just completely wrong: yes, try_get_compound_head() handles
speculative page references, but the point is that try_get_page() does
not, and must not.
So there's no lack of maintainance - there are fundamentally different
semantics.
A speculative page reference would be entirely wrong in "get_page()",
and it's entirely wrong in "try_get_page()". It's not about
speculation, it's purely about "uhhuh, you can't get this page because
you've tried to increment the reference count too much already".
The reason the kernel test robot noticed this bug was that it hit the
VM_BUG_ON() in __page_cache_add_speculative(), which is all about
verifying that the context of any speculative page access is correct.
But since that isn't what try_get_page() is all about, the VM_BUG_ON()
tests things that are not correct to test for try_get_page().
Reported-by: kernel test robot <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm
Pull more ARM cpufreq changes for v5.15-rc1 from Viresh Kumar:
"This adds a new cpufreq driver for Mediatek, which had been going
through reviews since last one year."
* 'cpufreq/arm/linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm:
cpufreq: mediatek-hw: Add support for CPUFREQ HW
cpufreq: Add of_perf_domain_get_sharing_cpumask
dt-bindings: cpufreq: add bindings for MediaTek cpufreq HW
|
|
Joakim Zhang reports that Wake-on-Lan with the stmmac ethernet driver broke
when moving the incorrect handling of mac link state out of mac_config().
This reason this breaks is because the stmmac's WoL is handled by the MAC
rather than the PHY, and phylink doesn't cater for that scenario.
This patch adds the necessary phylink code to handle suspend/resume events
according to whether the MAC still needs a valid link or not. This is the
barest minimum for this support.
Reported-by: Joakim Zhang <[email protected]>
Tested-by: Joakim Zhang <[email protected]>
Signed-off-by: Russell King (Oracle) <[email protected]>
Signed-off-by: Joakim Zhang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Pull libata fixes from Jens Axboe:
"Fixes for queued trim on certain Samsung SSDs, in conjunction with
certain ATI controllers"
* tag 'libata-5.15-2021-09-05' of git://git.kernel.dk/linux-block:
libata: Add ATA_HORKAGE_NO_NCQ_ON_ATI for Samsung 860 and 870 SSD.
libata: add ATA_HORKAGE_NO_NCQ_TRIM for Samsung 860 and 870 SSDs
|
|
Pull io_uring fixes from Jens Axboe:
"As sometimes happens, two reports came in around the merge window open
that led to some fixes. Hence this one is a bit bigger than usual
followup fixes, but most of it will be going towards stable, outside
of the fixes that are addressing regressions from this merge window.
In detail:
- postgres is a heavy user of signals between tasks, and if we're
unlucky this can interfere with io-wq worker creation. Make sure
we're resilient against unrelated signal handling. This set of
changes also includes hardening against allocation failures, which
could previously had led to stalls.
- Some use cases that end up having a mix of bounded and unbounded
work would have starvation issues related to that. Split the
pending work lists to handle that better.
- Completion trace int -> unsigned -> long fix
- Fix issue with REGISTER_IOWQ_MAX_WORKERS and SQPOLL
- Fix regression with hash wait lock in this merge window
- Fix retry issued on block devices (Ming)
- Fix regression with links in this merge window (Pavel)
- Fix race with multi-shot poll and completions (Xiaoguang)
- Ensure regular file IO doesn't inadvertently skip completion
batching (Pavel)
- Ensure submissions are flushed after running task_work (Pavel)"
* tag 'for-5.15/io_uring-2021-09-04' of git://git.kernel.dk/linux-block:
io_uring: io_uring_complete() trace should take an integer
io_uring: fix possible poll event lost in multi shot mode
io_uring: prolong tctx_task_work() with flushing
io_uring: don't disable kiocb_done() CQE batching
io_uring: ensure IORING_REGISTER_IOWQ_MAX_WORKERS works with SQPOLL
io-wq: make worker creation resilient against signals
io-wq: get rid of FIXED worker flag
io-wq: only exit on fatal signals
io-wq: split bounded and unbounded work into separate lists
io-wq: fix queue stalling race
io_uring: don't submit half-prepared drain request
io_uring: fix queueing half-created requests
io-wq: ensure that hash wait lock is IRQ disabling
io_uring: retry in case of short read on block device
io_uring: IORING_OP_WRITE needs hash_reg_file set
io-wq: fix race between adding work and activating a free worker
|
|
This VDUSE driver enables implementing software-emulated vDPA
devices in userspace. The vDPA device is created by
ioctl(VDUSE_CREATE_DEV) on /dev/vduse/control. Then a char device
interface (/dev/vduse/$NAME) is exported to userspace for device
emulation.
In order to make the device emulation more secure, the device's
control path is handled in kernel. A message mechnism is introduced
to forward some dataplane related control messages to userspace.
And in the data path, the DMA buffer will be mapped into userspace
address space through different ways depending on the vDPA bus to
which the vDPA device is attached. In virtio-vdpa case, the MMU-based
software IOTLB is used to achieve that. And in vhost-vdpa case, the
DMA buffer is reside in a userspace memory region which can be shared
to the VDUSE userspace processs via transferring the shmfd.
For more details on VDUSE design and usage, please see the follow-on
Documentation commit.
NB(mst): when merging this with
b542e383d8c0 ("eventfd: Make signal recursion protection a task bit")
replace eventfd_signal_count with eventfd_signal_allowed,
and drop the previous
("eventfd: Export eventfd_wake_count to modules").
Signed-off-by: Xie Yongji <[email protected]>
Acked-by: Jason Wang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Michael S. Tsirkin <[email protected]>
|
|
This patch introduces an attribute for vDPA device to indicate
whether virtual address can be used. If vDPA device driver set
it, vhost-vdpa bus driver will not pin user page and transfer
userspace virtual address instead of physical address during
DMA mapping. And corresponding vma->vm_file and offset will be
also passed as an opaque pointer.
Suggested-by: Jason Wang <[email protected]>
Signed-off-by: Xie Yongji <[email protected]>
Acked-by: Jason Wang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Michael S. Tsirkin <[email protected]>
|
|
Add an opaque pointer for DMA mapping.
Suggested-by: Jason Wang <[email protected]>
Signed-off-by: Xie Yongji <[email protected]>
Acked-by: Jason Wang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Michael S. Tsirkin <[email protected]>
|
|
Add an opaque pointer for vhost IOTLB. And introduce
vhost_iotlb_add_range_ctx() to accept it.
Suggested-by: Jason Wang <[email protected]>
Signed-off-by: Xie Yongji <[email protected]>
Acked-by: Jason Wang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Michael S. Tsirkin <[email protected]>
|
|
This adds a new callback to support device specific reset
behavior. The vdpa bus driver will call the reset function
instead of setting status to zero during resetting.
Signed-off-by: Xie Yongji <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Michael S. Tsirkin <[email protected]>
|
|
Fix some code indent issues and following checkpatch warning:
WARNING: Prefer 'unsigned int' to bare use of 'unsigned'
371: FILE: include/linux/vdpa.h:371:
+static inline void vdpa_get_config(struct vdpa_device *vdev, unsigned offset,
Signed-off-by: Xie Yongji <[email protected]>
Acked-by: Jason Wang <[email protected]>
Reviewed-by: Stefano Garzarella <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Michael S. Tsirkin <[email protected]>
|
|
Export receive_fd() so that some modules can use
it to pass file descriptor between processes without
missing any security stuffs.
Signed-off-by: Xie Yongji <[email protected]>
Acked-by: Jason Wang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Michael S. Tsirkin <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 updates for 5.15
- Page ownership tracking between host EL1 and EL2
- Rely on userspace page tables to create large stage-2 mappings
- Fix incompatibility between pKVM and kmemleak
- Fix the PMU reset state, and improve the performance of the virtual PMU
- Move over to the generic KVM entry code
- Address PSCI reset issues w.r.t. save/restore
- Preliminary rework for the upcoming pKVM fixed feature
- A bunch of MM cleanups
- a vGIC fix for timer spurious interrupts
- Various cleanups
|
|
Add a new stat that counts the number of times a remote TLB flush is
requested, regardless of whether it kicks vCPUs out of guest mode. This
allows us to look at how often flushes are initiated.
Unlike remote_tlb_flush, this one applies to ARM's instruction-set-based
TLB flush implementation, so apply it there too.
Original-by: David Matlack <[email protected]>
Signed-off-by: Jing Zhang <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Add of_perf_domain_get_sharing_cpumask function to group cpu
to specific performance domain.
Signed-off-by: Hector.Yuan <[email protected]>
[ Viresh: create separate routine parse_perf_domain() and always set the
cpumask. ]
Signed-off-by: Viresh Kumar <[email protected]>
|
|
This bit is used to handle POSIX MSG_EOR flag passed from
userspace in 'send*()' system calls. It marks end of each
record and is visible to receiver using 'recvmsg()' system
call.
Signed-off-by: Arseny Krasnov <[email protected]>
Reviewed-by: Stefano Garzarella <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Michael S. Tsirkin <[email protected]>
|
|
This current implemented bit is used to mark end of messages
('EOM' - end of message), not records('EOR' - end of record).
Also rename 'record' to 'message' in implementation as it is
different things.
Signed-off-by: Arseny Krasnov <[email protected]>
Reviewed-by: Stefano Garzarella <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Michael S. Tsirkin <[email protected]>
|
|
This synchronizes the virtio ids with the latest list from virtio
specification.
Signed-off-by: Viresh Kumar <[email protected]>
Link: https://lore.kernel.org/r/61b27e3bc61fb0c9f067001e95cfafc5d37d414a.1627362340.git.viresh.kumar@linaro.org
Signed-off-by: Michael S. Tsirkin <[email protected]>
|
|
Recent work on converting address list to a tree made it obvious
we need an abstraction around writing netdev->dev_addr. Without
such abstraction updating the main device address is invisible
to the core.
Introduce a number of helpers which for now just wrap memcpy()
but in the future can make necessary changes to the address
tree.
Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
- simplify the Kconfig use of FTRACE and TRACE_IRQFLAGS_SUPPORT
- bootconfig can now start histograms
- bootconfig supports group/all enabling
- histograms now can put values in linear size buckets
- execnames can be passed to synthetic events
- introduce "event probes" that attach to other events and can retrieve
data from pointers of fields, or record fields as different types (a
pointer to a string as a string instead of just a hex number)
- various fixes and clean ups
* tag 'trace-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (35 commits)
tracing/doc: Fix table format in histogram code
selftests/ftrace: Add selftest for testing duplicate eprobes and kprobes
selftests/ftrace: Add selftest for testing eprobe events on synthetic events
selftests/ftrace: Add test case to test adding and removing of event probe
selftests/ftrace: Fix requirement check of README file
selftests/ftrace: Add clear_dynamic_events() to test cases
tracing: Add a probe that attaches to trace events
tracing/probes: Reject events which have the same name of existing one
tracing/probes: Have process_fetch_insn() take a void * instead of pt_regs
tracing/probe: Change traceprobe_set_print_fmt() to take a type
tracing/probes: Use struct_size() instead of defining custom macros
tracing/probes: Allow for dot delimiter as well as slash for system names
tracing/probe: Have traceprobe_parse_probe_arg() take a const arg
tracing: Have dynamic events have a ref counter
tracing: Add DYNAMIC flag for dynamic events
tracing: Replace deprecated CPU-hotplug functions.
MAINTAINERS: Add an entry for os noise/latency
tracepoint: Fix kerneldoc comments
bootconfig/tracing/ktest: Update ktest example for boot-time tracing
tools/bootconfig: Use per-group/all enable option in ftrace2bconf script
...
|
|
Pull MAP_DENYWRITE removal from David Hildenbrand:
"Remove all in-tree usage of MAP_DENYWRITE from the kernel and remove
VM_DENYWRITE.
There are some (minor) user-visible changes:
- We no longer deny write access to shared libaries loaded via legacy
uselib(); this behavior matches modern user space e.g. dlopen().
- We no longer deny write access to the elf interpreter after exec
completed, treating it just like shared libraries (which it often
is).
- We always deny write access to the file linked via /proc/pid/exe:
sys_prctl(PR_SET_MM_MAP/EXE_FILE) will fail if write access to the
file cannot be denied, and write access to the file will remain
denied until the link is effectivel gone (exec, termination,
sys_prctl(PR_SET_MM_MAP/EXE_FILE)) -- just as if exec'ing the file.
Cross-compiled for a bunch of architectures (alpha, microblaze, i386,
s390x, ...) and verified via ltp that especially the relevant tests
(i.e., creat07 and execve04) continue working as expected"
* tag 'denywrite-for-5.15' of git://github.com/davidhildenbrand/linux:
fs: update documentation of get_write_access() and friends
mm: ignore MAP_DENYWRITE in ksys_mmap_pgoff()
mm: remove VM_DENYWRITE
binfmt: remove in-tree usage of MAP_DENYWRITE
kernel/fork: always deny write access to current MM exe_file
kernel/fork: factor out replacing the current MM exe_file
binfmt: don't use MAP_DENYWRITE when loading shared libraries via uselib()
|