Age | Commit message (Collapse) | Author | Files | Lines |
|
Patch series "mm, memcg: memory.{low,min} reclaim fix & cleanup", v4.
This series contains a fix for a edge case in my earlier protection
calculation patches, and a patch to make the area overall a little more
robust to hopefully help avoid this in future.
This patch (of 2):
A cgroup can have both memory protection and a memory limit to isolate it
from its siblings in both directions - for example, to prevent it from
being shrunk below 2G under high pressure from outside, but also from
growing beyond 4G under low pressure.
Commit 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim")
implemented proportional scan pressure so that multiple siblings in excess
of their protection settings don't get reclaimed equally but instead in
accordance to their unprotected portion.
During limit reclaim, this proportionality shouldn't apply of course:
there is no competition, all pressure is from within the cgroup and should
be applied as such. Reclaim should operate at full efficiency.
However, mem_cgroup_protected() never expected anybody to look at the
effective protection values when it indicated that the cgroup is above its
protection. As a result, a query during limit reclaim may return stale
protection values that were calculated by a previous reclaim cycle in
which the cgroup did have siblings.
When this happens, reclaim is unnecessarily hesitant and potentially slow
to meet the desired limit. In theory this could lead to premature OOM
kills, although it's not obvious this has occurred in practice.
Workaround the problem by special casing reclaim roots in
mem_cgroup_protection. These memcgs are never participating in the
reclaim protection because the reclaim is internal.
We have to ignore effective protection values for reclaim roots because
mem_cgroup_protected might be called from racing reclaim contexts with
different roots. Calculation is relying on root -> leaf tree traversal
therefore top-down reclaim protection invariants should hold. The only
exception is the reclaim root which should have effective protection set
to 0 but that would be problematic for the following setup:
Let's have global and A's reclaim in parallel:
|
A (low=2G, usage = 3G, max = 3G, children_low_usage = 1.5G)
|\
| C (low = 1G, usage = 2.5G)
B (low = 1G, usage = 0.5G)
for A reclaim we have
B.elow = B.low
C.elow = C.low
For the global reclaim
A.elow = A.low
B.elow = min(B.usage, B.low) because children_low_usage <= A.elow
C.elow = min(C.usage, C.low)
With the effective values resetting we have A reclaim
A.elow = 0
B.elow = B.low
C.elow = C.low
and global reclaim could see the above and then
B.elow = C.elow = 0 because children_low_usage > A.elow
Which means that protected memcgs would get reclaimed.
In future we would like to make mem_cgroup_protected more robust against
racing reclaim contexts but that is likely more complex solution than this
simple workaround.
[[email protected] - large part of the changelog]
[[email protected] - workaround explanation]
[[email protected] - retitle]
Fixes: 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim")
Signed-off-by: Yafang Shao <[email protected]>
Signed-off-by: Chris Down <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Chris Down <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/044fb8ecffd001c7905d27c0c2ad998069fdc396.1594638158.git.chris@chrisdown.name
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Reclaim retries have been set to 5 since the beginning of time in
commit 66e1707bc346 ("Memory controller: add per cgroup LRU and
reclaim"). However, we now have a generally agreed-upon standard for
page reclaim: MAX_RECLAIM_RETRIES (currently 16), added many years later
in commit 0a0337e0d1d1 ("mm, oom: rework oom detection").
In the absence of a compelling reason to declare an OOM earlier in memcg
context than page allocator context, it seems reasonable to supplant
MEM_CGROUP_RECLAIM_RETRIES with MAX_RECLAIM_RETRIES, making the page
allocator and memcg internals more similar in semantics when reclaim
fails to produce results, avoiding premature OOMs or throttling.
Signed-off-by: Chris Down <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Roman Gushchin <[email protected]>
Link: http://lkml.kernel.org/r/da557856c9c7654308eaff4eedc1952a95e8df5f.1594640214.git.chris@chrisdown.name
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "mm, memcg: reclaim harder before high throttling", v2.
This patch (of 2):
In Facebook production, we've seen cases where cgroups have been put into
allocator throttling even when they appear to have a lot of slack file
caches which should be trivially reclaimable.
Looking more closely, the problem is that we only try a single cgroup
reclaim walk for each return to usermode before calculating whether or not
we should throttle. This single attempt doesn't produce enough pressure
to shrink for cgroups with a rapidly growing amount of file caches prior
to entering allocator throttling.
As an example, we see that threads in an affected cgroup are stuck in
allocator throttling:
# for i in $(cat cgroup.threads); do
> grep over_high "/proc/$i/stack"
> done
[<0>] mem_cgroup_handle_over_high+0x10b/0x150
[<0>] mem_cgroup_handle_over_high+0x10b/0x150
[<0>] mem_cgroup_handle_over_high+0x10b/0x150
...however, there is no I/O pressure reported by PSI, despite a lot of
slack file pages:
# cat memory.pressure
some avg10=78.50 avg60=84.99 avg300=84.53 total=5702440903
full avg10=78.50 avg60=84.99 avg300=84.53 total=5702116959
# cat io.pressure
some avg10=0.00 avg60=0.00 avg300=0.00 total=78051391
full avg10=0.00 avg60=0.00 avg300=0.00 total=78049640
# grep _file memory.stat
inactive_file 1370939392
active_file 661635072
This patch changes the behaviour to retry reclaim either until the current
task goes below the 10ms grace period, or we are making no reclaim
progress at all. In the latter case, we enter reclaim throttling as
before.
To a user, there's no intuitive reason for the reclaim behaviour to differ
from hitting memory.high as part of a new allocation, as opposed to
hitting memory.high because someone lowered its value. As such this also
brings an added benefit: it unifies the reclaim behaviour between the two.
There's precedent for this behaviour: we already do reclaim retries when
writing to memory.{high,max}, in max reclaim, and in the page allocator
itself.
Signed-off-by: Chris Down <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Roman Gushchin <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/a4e23b59e9ef499b575ae73a8120ee089b7d3373.1594640214.git.chris@chrisdown.name
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Memory.high limit is implemented in a way such that the kernel penalizes
all threads which are allocating a memory over the limit. Forcing all
threads into the synchronous reclaim and adding some artificial delays
allows to slow down the memory consumption and potentially give some time
for userspace oom handlers/resource control agents to react.
It works nicely if the memory usage is hitting the limit from below,
however it works sub-optimal if a user adjusts memory.high to a value way
below the current memory usage. It basically forces all workload threads
(doing any memory allocations) into the synchronous reclaim and sleep.
This makes the workload completely unresponsive for a long period of time
and can also lead to a system-wide contention on lru locks. It can happen
even if the workload is not actually tight on memory and has, for example,
a ton of cold pagecache.
In the current implementation writing to memory.high causes an atomic
update of page counter's high value followed by an attempt to reclaim
enough memory to fit into the new limit. To fix the problem described
above, all we need is to change the order of execution: try to push the
memory usage under the limit first, and only then set the new high limit.
Reported-by: Domas Mituzas <[email protected]>
Signed-off-by: Roman Gushchin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Chris Down <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Currently memcg_kmem_enabled() is optimized for the kernel memory
accounting being off. It was so for a long time, and arguably the reason
behind was that the kernel memory accounting was initially an opt-in
feature. However, now it's on by default on both cgroup v1 and cgroup v2,
and it's on for all cgroups. So let's switch over to
static_branch_likely() to reflect this fact.
Unlikely there is a significant performance difference, as the cost of a
memory allocation and its accounting significantly exceeds the cost of a
jump. However, the conversion makes the code look more logically.
Signed-off-by: Roman Gushchin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Pekka Enberg <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
charge_slab_page() and uncharge_slab_page() are not related anymore to
memcg charging and uncharging. In order to make their names less
confusing, let's rename them to account_slab_page() and
unaccount_slab_page() respectively.
Signed-off-by: Roman Gushchin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Pekka Enberg <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
charge_slab_page() is not using the gfp argument anymore,
remove it.
Signed-off-by: Roman Gushchin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Currently the kernel stack is being accounted per-zone. There is no need
to do that. In addition due to being per-zone, memcg has to keep a
separate MEMCG_KERNEL_STACK_KB. Make the stat per-node and deprecate
MEMCG_KERNEL_STACK_KB as memcg_stat_item is an extension of
node_stat_item. In addition localize the kernel stack stats updates to
account_kernel_stack().
Signed-off-by: Shakeel Butt <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Roman Gushchin <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Add a drgn-based tool to display slab information for a given memcg. Can
replace cgroup v1 memory.kmem.slabinfo interface on cgroup v2, but in a
more flexiable way.
Currently supports only SLUB configuration, but SLAB can be trivially
added later.
Output example:
$ sudo ./tools/cgroup/memcg_slabinfo.py /sys/fs/cgroup/user.slice/user-111017.slice/user\@111017.service
shmem_inode_cache 92 92 704 46 8 : tunables 0 0 0 : slabdata 2 2 0
eventpoll_pwq 56 56 72 56 1 : tunables 0 0 0 : slabdata 1 1 0
eventpoll_epi 32 32 128 32 1 : tunables 0 0 0 : slabdata 1 1 0
kmalloc-8 0 0 8 512 1 : tunables 0 0 0 : slabdata 0 0 0
kmalloc-96 0 0 96 42 1 : tunables 0 0 0 : slabdata 0 0 0
kmalloc-2048 0 0 2048 16 8 : tunables 0 0 0 : slabdata 0 0 0
kmalloc-64 128 128 64 64 1 : tunables 0 0 0 : slabdata 2 2 0
mm_struct 160 160 1024 32 8 : tunables 0 0 0 : slabdata 5 5 0
signal_cache 96 96 1024 32 8 : tunables 0 0 0 : slabdata 3 3 0
sighand_cache 45 45 2112 15 8 : tunables 0 0 0 : slabdata 3 3 0
files_cache 138 138 704 46 8 : tunables 0 0 0 : slabdata 3 3 0
task_delay_info 153 153 80 51 1 : tunables 0 0 0 : slabdata 3 3 0
task_struct 27 27 3520 9 8 : tunables 0 0 0 : slabdata 3 3 0
radix_tree_node 56 56 584 28 4 : tunables 0 0 0 : slabdata 2 2 0
btrfs_inode 140 140 1136 28 8 : tunables 0 0 0 : slabdata 5 5 0
kmalloc-1024 64 64 1024 32 8 : tunables 0 0 0 : slabdata 2 2 0
kmalloc-192 84 84 192 42 2 : tunables 0 0 0 : slabdata 2 2 0
inode_cache 54 54 600 27 4 : tunables 0 0 0 : slabdata 2 2 0
kmalloc-128 0 0 128 32 1 : tunables 0 0 0 : slabdata 0 0 0
kmalloc-512 32 32 512 32 4 : tunables 0 0 0 : slabdata 1 1 0
skbuff_head_cache 32 32 256 32 2 : tunables 0 0 0 : slabdata 1 1 0
sock_inode_cache 46 46 704 46 8 : tunables 0 0 0 : slabdata 1 1 0
cred_jar 378 378 192 42 2 : tunables 0 0 0 : slabdata 9 9 0
proc_inode_cache 96 96 672 24 4 : tunables 0 0 0 : slabdata 4 4 0
dentry 336 336 192 42 2 : tunables 0 0 0 : slabdata 8 8 0
filp 697 864 256 32 2 : tunables 0 0 0 : slabdata 27 27 0
anon_vma 644 644 88 46 1 : tunables 0 0 0 : slabdata 14 14 0
pid 1408 1408 64 64 1 : tunables 0 0 0 : slabdata 22 22 0
vm_area_struct 1200 1200 200 40 2 : tunables 0 0 0 : slabdata 30 30 0
Signed-off-by: Roman Gushchin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Add some tests to cover the kernel memory accounting functionality. These
are covering some issues (and changes) we had recently.
1) A test which allocates a lot of negative dentries, checks memcg slab
statistics, creates memory pressure by setting memory.max to some low
value and checks that some number of slabs was reclaimed.
2) A test which covers side effects of memcg destruction: it creates
and destroys a large number of sub-cgroups, each containing a
multi-threaded workload which allocates and releases some kernel
memory. Then it checks that the charge ans memory.stats do add up on
the parent level.
3) A test which reads /proc/kpagecgroup and implicitly checks that it
doesn't crash the system.
4) A test which spawns a large number of threads and checks that the
kernel stacks accounting works as expected.
5) A test which checks that living charged slab objects are not
preventing the memory cgroup from being released after being deleted by
a user.
Signed-off-by: Roman Gushchin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Instead of having two sets of kmem_caches: one for system-wide and
non-accounted allocations and the second one shared by all accounted
allocations, we can use just one.
The idea is simple: space for obj_cgroup metadata can be allocated on
demand and filled only for accounted allocations.
It allows to remove a bunch of code which is required to handle kmem_cache
clones for accounted allocations. There is no more need to create them,
accumulate statistics, propagate attributes, etc. It's a quite
significant simplification.
Also, because the total number of slab_caches is reduced almost twice (not
all kmem_caches have a memcg clone), some additional memory savings are
expected. On my devvm it additionally saves about 3.5% of slab memory.
[[email protected]: fix build on MIPS]
Link: http://lkml.kernel.org/r/[email protected]
Suggested-by: Johannes Weiner <[email protected]>
Signed-off-by: Roman Gushchin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Naresh Kamboju <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
memcg_accumulate_slabinfo() is never called with a non-root kmem_cache as
a first argument, so the is_root_cache(s) check is redundant and can be
removed without any functional change.
Signed-off-by: Roman Gushchin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Tejun Heo <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Currently there are two lists of kmem_caches:
1) slab_caches, which contains all kmem_caches,
2) slab_root_caches, which contains only root kmem_caches.
And there is some preprocessor magic to have a single list if
CONFIG_MEMCG_KMEM isn't enabled.
It was required earlier because the number of non-root kmem_caches was
proportional to the number of memory cgroups and could reach really big
values. Now, when it cannot exceed the number of root kmem_caches, there
is really no reason to maintain two lists.
We never iterate over the slab_root_caches list on any hot paths, so it's
perfectly fine to iterate over slab_caches and filter out non-root
kmem_caches.
It allows to remove a lot of config-dependent code and two pointers from
the kmem_cache structure.
Signed-off-by: Roman Gushchin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Tejun Heo <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The memcg_kmem_get_cache() function became really trivial, so let's just
inline it into the single call point: memcg_slab_pre_alloc_hook().
It will make the code less bulky and can also help the compiler to
generate a better code.
Signed-off-by: Roman Gushchin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Tejun Heo <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Because the number of non-root kmem_caches doesn't depend on the number of
memory cgroups anymore and is generally not very big, there is no more
need for a dedicated workqueue.
Also, as there is no more need to pass any arguments to the
memcg_create_kmem_cache() except the root kmem_cache, it's possible to
just embed the work structure into the kmem_cache and avoid the dynamic
allocation of the work structure.
This will also simplify the synchronization: for each root kmem_cache
there is only one work. So there will be no more concurrent attempts to
create a non-root kmem_cache for a root kmem_cache: the second and all
following attempts to queue the work will fail.
On the kmem_cache destruction path there is no more need to call the
expensive flush_workqueue() and wait for all pending works to be finished.
Instead, cancel_work_sync() can be used to cancel/wait for only one work.
Signed-off-by: Roman Gushchin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Tejun Heo <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This is fairly big but mostly red patch, which makes all accounted slab
allocations use a single set of kmem_caches instead of creating a separate
set for each memory cgroup.
Because the number of non-root kmem_caches is now capped by the number of
root kmem_caches, there is no need to shrink or destroy them prematurely.
They can be perfectly destroyed together with their root counterparts.
This allows to dramatically simplify the management of non-root
kmem_caches and delete a ton of code.
This patch performs the following changes:
1) introduces memcg_params.memcg_cache pointer to represent the
kmem_cache which will be used for all non-root allocations
2) reuses the existing memcg kmem_cache creation mechanism
to create memcg kmem_cache on the first allocation attempt
3) memcg kmem_caches are named <kmemcache_name>-memcg,
e.g. dentry-memcg
4) simplifies memcg_kmem_get_cache() to just return memcg kmem_cache
or schedule it's creation and return the root cache
5) removes almost all non-root kmem_cache management code
(separate refcounter, reparenting, shrinking, etc)
6) makes slab debugfs to display root_mem_cgroup css id and never
show :dead and :deact flags in the memcg_slabinfo attribute.
Following patches in the series will simplify the kmem_cache creation.
Signed-off-by: Roman Gushchin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Tejun Heo <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
To make the memcg_kmem_bypass() function available outside of the
memcontrol.c, let's move it to memcontrol.h. The function is small and
nicely fits into static inline sort of functions.
It will be used from the slab code.
Signed-off-by: Roman Gushchin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Tejun Heo <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Deprecate memory.kmem.slabinfo.
An empty file will be presented if corresponding config options are
enabled.
The interface is implementation dependent, isn't present in cgroup v2, and
is generally useful only for core mm debugging purposes. In other words,
it doesn't provide any value for the absolute majority of users.
A drgn-based replacement can be found in
tools/cgroup/memcg_slabinfo.py. It does support cgroup v1 and v2,
mimics memory.kmem.slabinfo output and also allows to get any
additional information without a need to recompile the kernel.
If a drgn-based solution is too slow for a task, a bpf-based tracing tool
can be used, which can easily keep track of all slab allocations belonging
to a memory cgroup.
Signed-off-by: Roman Gushchin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Tejun Heo <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Switch to per-object accounting of non-root slab objects.
Charging is performed using obj_cgroup API in the pre_alloc hook.
Obj_cgroup is charged with the size of the object and the size of
metadata: as now it's the size of an obj_cgroup pointer. If the amount of
memory has been charged successfully, the actual allocation code is
executed. Otherwise, -ENOMEM is returned.
In the post_alloc hook if the actual allocation succeeded, corresponding
vmstats are bumped and the obj_cgroup pointer is saved. Otherwise, the
charge is canceled.
On the free path obj_cgroup pointer is obtained and used to uncharge the
size of the releasing object.
Memcg and lruvec counters are now representing only memory used by active
slab objects and do not include the free space. The free space is shared
and doesn't belong to any specific cgroup.
Global per-node slab vmstats are still modified from
(un)charge_slab_page() functions. The idea is to keep all slab pages
accounted as slab pages on system level.
Signed-off-by: Roman Gushchin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Tejun Heo <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Store the obj_cgroup pointer in the corresponding place of
page->obj_cgroups for each allocated non-root slab object. Make sure that
each allocated object holds a reference to obj_cgroup.
Objcg pointer is obtained from the memcg->objcg dereferencing in
memcg_kmem_get_cache() and passed from pre_alloc_hook to post_alloc_hook.
Then in case of successful allocation(s) it's getting stored in the
page->obj_cgroups vector.
The objcg obtaining part look a bit bulky now, but it will be simplified
by next commits in the series.
Signed-off-by: Roman Gushchin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Tejun Heo <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Allocate and release memory to store obj_cgroup pointers for each non-root
slab page. Reuse page->mem_cgroup pointer to store a pointer to the
allocated space.
This commit temporarily increases the memory footprint of the kernel memory
accounting. To store obj_cgroup pointers we'll need a place for an
objcg_pointer for each allocated object. However, the following patches
in the series will enable sharing of slab pages between memory cgroups,
which will dramatically increase the total slab utilization. And the final
memory footprint will be significantly smaller than before.
To distinguish between obj_cgroups and memcg pointers in case when it's
not obvious which one is used (as in page_cgroup_ino()), let's always set
the lowest bit in the obj_cgroup case. The original obj_cgroups
pointer is marked to be ignored by kmemleak, which otherwise would
report a memory leak for each allocated vector.
Signed-off-by: Roman Gushchin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Tejun Heo <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Obj_cgroup API provides an ability to account sub-page sized kernel
objects, which potentially outlive the original memory cgroup.
The top-level API consists of the following functions:
bool obj_cgroup_tryget(struct obj_cgroup *objcg);
void obj_cgroup_get(struct obj_cgroup *objcg);
void obj_cgroup_put(struct obj_cgroup *objcg);
int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size);
void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size);
struct mem_cgroup *obj_cgroup_memcg(struct obj_cgroup *objcg);
struct obj_cgroup *get_obj_cgroup_from_current(void);
Object cgroup is basically a pointer to a memory cgroup with a per-cpu
reference counter. It substitutes a memory cgroup in places where it's
necessary to charge a custom amount of bytes instead of pages.
All charged memory rounded down to pages is charged to the corresponding
memory cgroup using __memcg_kmem_charge().
It implements reparenting: on memcg offlining it's getting reattached to
the parent memory cgroup. Each online memory cgroup has an associated
active object cgroup to handle new allocations and the list of all
attached object cgroups. On offlining of a cgroup this list is reparented
and for each object cgroup in the list the memcg pointer is swapped to the
parent memory cgroup. It prevents long-living objects from pinning the
original memory cgroup in the memory.
The implementation is based on byte-sized per-cpu stocks. A sub-page
sized leftover is stored in an atomic field, which is a part of obj_cgroup
object. So on cgroup offlining the leftover is automatically reparented.
memcg->objcg is rcu protected. objcg->memcg is a raw pointer, which is
always pointing at a memory cgroup, but can be atomically swapped to the
parent memory cgroup. So a user must ensure the lifetime of the
cgroup, e.g. grab rcu_read_lock or css_set_lock.
Suggested-by: Johannes Weiner <[email protected]>
Signed-off-by: Roman Gushchin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The reference counting of a memcg is currently coupled directly to how
many 4k pages are charged to it. This doesn't work well with Roman's new
slab controller, which maintains pools of objects and doesn't want to keep
an extra balance sheet for the pages backing those objects.
This unusual refcounting design (reference counts usually track pointers
to an object) is only for historical reasons: memcg used to not take any
css references and simply stalled offlining until all charges had been
reparented and the page counters had dropped to zero. When we got rid of
the reparenting requirement, the simple mechanical translation was to take
a reference for every charge.
More historical context can be found in commit e8ea14cc6ead ("mm:
memcontrol: take a css reference for each charged page"), commit
64f219938941 ("mm: memcontrol: remove obsolete kmemcg pinning tricks") and
commit b2052564e66d ("mm: memcontrol: continue cache reclaim from offlined
groups").
The new slab controller exposes the limitations in this scheme, so let's
switch it to a more idiomatic reference counting model based on actual
kernel pointers to the memcg:
- The per-cpu stock holds a reference to the memcg its caching
- User pages hold a reference for their page->mem_cgroup. Transparent
huge pages will no longer acquire tail references in advance, we'll
get them if needed during the split.
- Kernel pages hold a reference for their page->mem_cgroup
- Pages allocated in the root cgroup will acquire and release css
references for simplicity. css_get() and css_put() optimize that.
- The current memcg_charge_slab() already hacked around the per-charge
references; this change gets rid of that as well.
- tcp accounting will handle reference in mem_cgroup_sk_{alloc,free}
Roman:
1) Rebased on top of the current mm tree: added css_get() in
mem_cgroup_charge(), dropped mem_cgroup_try_charge() part
2) I've reformatted commit references in the commit log to make
checkpatch.pl happy.
[[email protected]: remove css_put_many() from __mem_cgroup_clear_mc()]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Johannes Weiner <[email protected]>
Signed-off-by: Roman Gushchin <[email protected]>
Signed-off-by: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This commit implements SLUB version of the obj_to_index() function, which
will be required to calculate the offset of obj_cgroup in the obj_cgroups
vector to store/obtain the objcg ownership data.
To make it faster, let's repeat the SLAB's trick introduced by commit
6a2d7a955d8d ("SLAB: use a multiply instead of a divide in
obj_to_index()") and avoid an expensive division.
Vlastimil Babka noticed, that SLUB does have already a similar function
called slab_index(), which is defined only if SLUB_DEBUG is enabled. The
function does a similar math, but with a division, and it also takes a
page address instead of a page pointer.
Let's remove slab_index() and replace it with the new helper
__obj_to_index(), which takes a page address. obj_to_index() will be a
simple wrapper taking a page pointer and passing page_address(page) into
__obj_to_index().
Signed-off-by: Roman Gushchin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Tejun Heo <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
In order to prepare for per-object slab memory accounting, convert
NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE vmstat items to bytes.
To make it obvious, rename them to NR_SLAB_RECLAIMABLE_B and
NR_SLAB_UNRECLAIMABLE_B (similar to NR_KERNEL_STACK_KB).
Internally global and per-node counters are stored in pages, however memcg
and lruvec counters are stored in bytes. This scheme may look weird, but
only for now. As soon as slab pages will be shared between multiple
cgroups, global and node counters will reflect the total number of slab
pages. However memcg and lruvec counters will be used for per-memcg slab
memory tracking, which will take separate kernel objects in the account.
Keeping global and node counters in pages helps to avoid additional
overhead.
The size of slab memory shouldn't exceed 4Gb on 32-bit machines, so it
will fit into atomic_long_t we use for vmstats.
Signed-off-by: Roman Gushchin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Tejun Heo <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
To implement per-object slab memory accounting, we need to convert slab
vmstat counters to bytes. Actually, out of 4 levels of counters: global,
per-node, per-memcg and per-lruvec only two last levels will require
byte-sized counters. It's because global and per-node counters will be
counting the number of slab pages, and per-memcg and per-lruvec will be
counting the amount of memory taken by charged slab objects.
Converting all vmstat counters to bytes or even all slab counters to bytes
would introduce an additional overhead. So instead let's store global and
per-node counters in pages, and memcg and lruvec counters in bytes.
To make the API clean all access helpers (both on the read and write
sides) are dealing with bytes.
To avoid back-and-forth conversions a new flavor of read-side helpers is
introduced, which always returns values in pages: node_page_state_pages()
and global_node_page_state_pages().
Actually new helpers are just reading raw values. Old helpers are simple
wrappers, which will complain on an attempt to read byte value, because at
the moment no one actually needs bytes.
Thanks to Johannes Weiner for the idea of having the byte-sized API on top
of the page-sized internal storage.
Signed-off-by: Roman Gushchin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Tejun Heo <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
__mod_lruvec_state()
Patch series "The new cgroup slab memory controller", v7.
The patchset moves the accounting from the page level to the object level.
It allows to share slab pages between memory cgroups. This leads to a
significant win in the slab utilization (up to 45%) and the corresponding
drop in the total kernel memory footprint. The reduced number of
unmovable slab pages should also have a positive effect on the memory
fragmentation.
The patchset makes the slab accounting code simpler: there is no more need
in the complicated dynamic creation and destruction of per-cgroup slab
caches, all memory cgroups use a global set of shared slab caches. The
lifetime of slab caches is not more connected to the lifetime of memory
cgroups.
The more precise accounting does require more CPU, however in practice the
difference seems to be negligible. We've been using the new slab
controller in Facebook production for several months with different
workloads and haven't seen any noticeable regressions. What we've seen
were memory savings in order of 1 GB per host (it varied heavily depending
on the actual workload, size of RAM, number of CPUs, memory pressure,
etc).
The third version of the patchset added yet another step towards the
simplification of the code: sharing of slab caches between accounted and
non-accounted allocations. It comes with significant upsides (most
noticeable, a complete elimination of dynamic slab caches creation) but
not without some regression risks, so this change sits on top of the
patchset and is not completely merged in. So in the unlikely event of a
noticeable performance regression it can be reverted separately.
The slab memory accounting works in exactly the same way for SLAB and
SLUB. With both allocators the new controller shows significant memory
savings, with SLUB the difference is bigger. On my 16-core desktop
machine running Fedora 32 the size of the slab memory measured after the
start of the system was lower by 58% and 38% with SLUB and SLAB
correspondingly.
As an estimation of a potential CPU overhead, below are results of
slab_bulk_test01 test, kindly provided by Jesper D. Brouer. He also
helped with the evaluation of results.
The test can be found here: https://github.com/netoptimizer/prototype-kernel/
The smallest number in each row should be picked for a comparison.
SLUB-patched - bulk-API
- SLUB-patched : bulk_quick_reuse objects=1 : 187 - 90 - 224 cycles(tsc)
- SLUB-patched : bulk_quick_reuse objects=2 : 110 - 53 - 133 cycles(tsc)
- SLUB-patched : bulk_quick_reuse objects=3 : 88 - 95 - 42 cycles(tsc)
- SLUB-patched : bulk_quick_reuse objects=4 : 91 - 85 - 36 cycles(tsc)
- SLUB-patched : bulk_quick_reuse objects=8 : 32 - 66 - 32 cycles(tsc)
SLUB-original - bulk-API
- SLUB-original: bulk_quick_reuse objects=1 : 87 - 87 - 142 cycles(tsc)
- SLUB-original: bulk_quick_reuse objects=2 : 52 - 53 - 53 cycles(tsc)
- SLUB-original: bulk_quick_reuse objects=3 : 42 - 42 - 91 cycles(tsc)
- SLUB-original: bulk_quick_reuse objects=4 : 91 - 37 - 37 cycles(tsc)
- SLUB-original: bulk_quick_reuse objects=8 : 31 - 79 - 76 cycles(tsc)
SLAB-patched - bulk-API
- SLAB-patched : bulk_quick_reuse objects=1 : 67 - 67 - 140 cycles(tsc)
- SLAB-patched : bulk_quick_reuse objects=2 : 55 - 46 - 46 cycles(tsc)
- SLAB-patched : bulk_quick_reuse objects=3 : 93 - 94 - 39 cycles(tsc)
- SLAB-patched : bulk_quick_reuse objects=4 : 35 - 88 - 85 cycles(tsc)
- SLAB-patched : bulk_quick_reuse objects=8 : 30 - 30 - 30 cycles(tsc)
SLAB-original- bulk-API
- SLAB-original: bulk_quick_reuse objects=1 : 143 - 136 - 67 cycles(tsc)
- SLAB-original: bulk_quick_reuse objects=2 : 45 - 46 - 46 cycles(tsc)
- SLAB-original: bulk_quick_reuse objects=3 : 38 - 39 - 39 cycles(tsc)
- SLAB-original: bulk_quick_reuse objects=4 : 35 - 87 - 87 cycles(tsc)
- SLAB-original: bulk_quick_reuse objects=8 : 29 - 66 - 30 cycles(tsc)
This patch (of 19):
To convert memcg and lruvec slab counters to bytes there must be a way to
change these counters without touching node counters. Factor out
__mod_memcg_lruvec_state() out of __mod_lruvec_state().
Signed-off-by: Roman Gushchin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Tejun Heo <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Historically the kernel memory accounting was an opt-in feature, which
could be enabled for individual cgroups. But now it's not true, and it's
on by default both on cgroup v1 and cgroup v2. And as long as a user has
at least one non-root memory cgroup, the kernel memory accounting is on.
So in most setups it's either always on (if memory cgroups are in use and
kmem accounting is not disabled), either always off (otherwise).
memcg_kmem_enabled() is used in many places to guard the kernel memory
accounting code. If memcg_kmem_enabled() can reverse from returning true
to returning false (as now), we can't rely on it on release paths and have
to check if it was on before.
If we'll make memcg_kmem_enabled() irreversible (always returning true
after returning it for the first time), it'll make the general logic more
simple and robust. It also will allow to guard some checks which
otherwise would stay unguarded.
Reported-by: Naresh Kamboju <[email protected]>
Signed-off-by: Roman Gushchin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Tested-by: Naresh Kamboju <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The default is still set to inode32 for backwards compatibility, but
system administrators can opt in to the new 64-bit inode numbers by
either:
1. Passing inode64 on the command line when mounting, or
2. Configuring the kernel with CONFIG_TMPFS_INODE64=y
The inode64 and inode32 names are used based on existing precedent from
XFS.
[[email protected]: Kconfig fixes]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Chris Down <[email protected]>
Signed-off-by: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Amir Goldstein <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Jeff Layton <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Tejun Heo <[email protected]>
Link: http://lkml.kernel.org/r/8b23758d0c66b5e2263e08baf9c4b6a7565cbd8f.1594661218.git.chris@chrisdown.name
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "tmpfs: inode: Reduce risk of inum overflow", v7.
In Facebook production we are seeing heavy i_ino wraparounds on tmpfs. On
affected tiers, in excess of 10% of hosts show multiple files with
different content and the same inode number, with some servers even having
as many as 150 duplicated inode numbers with differing file content.
This causes actual, tangible problems in production. For example, we have
complaints from those working on remote caches that their application is
reporting cache corruptions because it uses (device, inodenum) to
establish the identity of a particular cache object, but because it's not
unique any more, the application refuses to continue and reports cache
corruption. Even worse, sometimes applications may not even detect the
corruption but may continue anyway, causing phantom and hard to debug
behaviour.
In general, userspace applications expect that (device, inodenum) should
be enough to be uniquely point to one inode, which seems fair enough. One
might also need to check the generation, but in this case:
1. That's not currently exposed to userspace
(ioctl(...FS_IOC_GETVERSION...) returns ENOTTY on tmpfs);
2. Even with generation, there shouldn't be two live inodes with the
same inode number on one device.
In order to mitigate this, we take a two-pronged approach:
1. Moving inum generation from being global to per-sb for tmpfs. This
itself allows some reduction in i_ino churn. This works on both 64-
and 32- bit machines.
2. Adding inode{64,32} for tmpfs. This fix is supported on machines with
64-bit ino_t only: we allow users to mount tmpfs with a new inode64
option that uses the full width of ino_t, or CONFIG_TMPFS_INODE64.
You can see how this compares to previous related patches which didn't
implement this per-superblock:
- https://patchwork.kernel.org/patch/11254001/
- https://patchwork.kernel.org/patch/11023915/
This patch (of 2):
get_next_ino has a number of problems:
- It uses and returns a uint, which is susceptible to become overflowed
if a lot of volatile inodes that use get_next_ino are created.
- It's global, with no specificity per-sb or even per-filesystem. This
means it's not that difficult to cause inode number wraparounds on a
single device, which can result in having multiple distinct inodes
with the same inode number.
This patch adds a per-superblock counter that mitigates the second case.
This design also allows us to later have a specific i_ino size per-device,
for example, allowing users to choose whether to use 32- or 64-bit inodes
for each tmpfs mount. This is implemented in the next commit.
For internal shmem mounts which may be less tolerant to spinlock delays,
we implement a percpu batching scheme which only takes the stat_lock at
each batch boundary.
Signed-off-by: Chris Down <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Cc: Amir Goldstein <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Jeff Layton <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Tejun Heo <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/1986b9d63b986f08ec07a4aa4b2275e718e47d8a.1594661218.git.chris@chrisdown.name
Signed-off-by: Linus Torvalds <[email protected]>
|
|
swap_readpage() does the sync io for one page, the io is not big,
normally, the io can be finished quickly, but it may take long time or
wait forever in case of io failure or discard.
This patch uses blk_io_schedule() instead of io_schedule() to avoid task
hung and crash (when set /proc/sys/kernel/hung_task_panic) when the above
exception occurs.
This is similar to the hung task avoidance in submit_bio_wait(),
blk_execute_rq() and __blkdev_direct_IO().
Signed-off-by: Xianting Tian <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Cc: Ming Lei <[email protected]>
Cc: Bart Van Assche <[email protected]>
Cc: Hannes Reinecke <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Hugh Dickins <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Fix W=1 compile warnings (invalid kerneldoc):
mm/swap_state.c:742: warning: Function parameter or member 'fentry' not described in 'swap_vma_readahead'
mm/swap_state.c:742: warning: Excess function parameter 'entry' description in 'swap_vma_readahead'
Signed-off-by: Krzysztof Kozlowski <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Because enable_swap_slots_cache can only become true in
enable_swap_slots_cache(), and depends on swap_slot_cache_initialized is
true before. That means, when enable_swap_slots_cache is true,
swap_slot_cache_initialized is true also.
So the condition:
"swap_slot_cache_enabled && swap_slot_cache_initialized"
can be reduced to "swap_slot_cache_enabled"
And in mathematics:
"!swap_slot_cache_enabled || !swap_slot_cache_initialized"
is equal to "!(swap_slot_cache_enabled && swap_slot_cache_initialized)"
So no functional change.
Signed-off-by: Zhen Lei <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Acked-by: Tim Chen <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Whether swap_slot_cache_initialized is true or false,
__reenable_swap_slots_cache() is always called. To make this meaning
clear, leave only one call to __reenable_swap_slots_cache(). This also
make it clearer what extra needs be done when swap_slot_cache_initialized
is false.
No functional change.
Signed-off-by: Zhen Lei <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Acked-by: Tim Chen <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "clean up some functions in mm/swap_slots.c".
When I studied the code of mm/swap_slots.c, I found some places can be
improved.
This patch (of 3):
Both "slots" and "slots_ret" are only need to be freed when cache already
allocated. Make them closer, seems more clear.
No functional change.
Signed-off-by: Zhen Lei <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Acked-by: Tim Chen <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The return value of populate_vma_page_range() is consistent with
__get_user_pages(), and so is the function comment of return value.
Signed-off-by: Tang Yizhou <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Reviewed-by: Ira Weiny <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
FGP_{WRITE|NOFS|NOWAIT} were missed in pagecache_get_page's kerneldoc
comment.
Signed-off-by: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Cc: Gang Deng <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Rik van Riel <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Since commit bbddabe2e436aa ("mm: filemap: only do access activations on
reads"), mark_page_accessed() is called for reads only. But the idle flag
is cleared by mark_page_accessed() so the idle flag won't get cleared if
the page is write accessed only.
Basically idle page tracking is used to estimate workingset size of
workload, noticeable size of workingset might be missed if the idle flag
is not maintained correctly.
It seems good enough to just clear idle flag for write operations.
Fixes: bbddabe2e436 ("mm: filemap: only do access activations on reads")
Reported-by: Gang Deng <[email protected]>
Signed-off-by: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Rik van Riel <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
If a compound page is being split while dump_page() is being run on that
page, we can end up calling compound_mapcount() on a page that is no
longer compound. This leads to a crash (already seen at least once in the
field), due to the VM_BUG_ON_PAGE() assertion inside compound_mapcount().
(The above is from Matthew Wilcox's analysis of Qian Cai's bug report.)
A similar problem is possible, via compound_pincount() instead of
compound_mapcount().
In order to avoid this kind of crash, make dump_page() slightly more
robust, by providing a pair of simpler routines that don't contain
assertions: head_mapcount() and head_pincount().
For debug tools, we don't want to go *too* far in this direction, but this
is a simple small fix, and the crash has already been seen, so it's a good
trade-off.
Reported-by: Qian Cai <[email protected]>
Suggested-by: Matthew Wilcox <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: William Kucharski <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The actual address of the struct page isn't particularly helpful, while
the hashed address helps match with other messages elsewhere. Add the PFN
that the page refers to in order to help diagnose problems where the page
is improperly aligned for the purpose.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
Acked-by: Mike Rapoport <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: William Kucharski <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The inode number helps correlate this page with debug messages elsewhere
in the kernel.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
Acked-by: Mike Rapoport <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: William Kucharski <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This is simpler to use than copy_from_kernel_nofault(). Also make some of
the related error messages less verbose.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Acked-by: Mike Rapoport <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: William Kucharski <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Tail page flags contain very little useful information. Print the head
page's flags instead. While the flags will contain "head" for tail pages,
this should not be too confusing as the previous line starts with the word
"head:" and so the flags should be interpreted as belonging to the head
page.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
Acked-by: Mike Rapoport <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: William Kucharski <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Simplify both the implementation and the output by splitting all the
compound page information onto a second line.
Reported-by: John Hubbard <[email protected]>
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Tested-by: John Hubbard <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
Acked-by: Mike Rapoport <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: William Kucharski <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "Improvements for dump_page()", v2.
Here's a sample dump of a pagecache tail page with all of the patches
applied:
page:000000006d1c49ca refcount:6 mapcount:0 mapping:00000000136b8d90 index:0x109 pfn:0x6c645
head:000000008bd38076 order:2 compound_mapcount:0 compound_pincount:0
aops:xfs_address_space_operations ino:800042 dentry name:"fd"
flags: 0x4000000000012014(uptodate|lru|private|head)
raw: 4000000000000000 ffffd46ac1b19101 ffffffff00000202 dead000000000004
raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000
head: 4000000000012014 ffffd46ac1b1bbc8 ffffd46ac1b1bc08 ffff91976f659560
head: 0000000000000108 ffff919773220680 00000006ffffffff 0000000000000000
page dumped because: testing
This patch (of 6):
If we can't call page_mapping() to get the page mapping, handle the
anon/ksm/movable bits correctly.
[[email protected]: augmented code comment from John]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
Acked-by: Mike Rapoport <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: William Kucharski <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This adds a specific description file for all arch page table helpers which
is in sync with the semantics being tested via CONFIG_DEBUG_VM_PGTABLE. All
future changes either to these descriptions here or the debug test should
always remain in sync.
[[email protected]: fold in Mike's patch for the rst document, fix typos in the rst document]
Link: http://lkml.kernel.org/r/[email protected]
Suggested-by: Mike Rapoport <[email protected]>
Signed-off-by: Anshuman Khandual <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Acked-by: Mike Rapoport <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Vineet Gupta <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Zi Yan <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This adds debug print information that enlists all tests getting executed
on a given platform. With dynamic debug enabled, the following
information will be splashed during boot. For compactness purpose,
dropped both time stamp and prefix (i.e debug_vm_pgtable) from this sample
output.
[debug_vm_pgtable ]: Validating architecture page table helpers
[pte_basic_tests ]: Validating PTE basic
[pmd_basic_tests ]: Validating PMD basic
[p4d_basic_tests ]: Validating P4D basic
[pgd_basic_tests ]: Validating PGD basic
[pte_clear_tests ]: Validating PTE clear
[pmd_clear_tests ]: Validating PMD clear
[pte_advanced_tests ]: Validating PTE advanced
[pmd_advanced_tests ]: Validating PMD advanced
[hugetlb_advanced_tests]: Validating HugeTLB advanced
[pmd_leaf_tests ]: Validating PMD leaf
[pmd_huge_tests ]: Validating PMD huge
[pte_savedwrite_tests ]: Validating PTE saved write
[pmd_savedwrite_tests ]: Validating PMD saved write
[pmd_populate_tests ]: Validating PMD populate
[pte_special_tests ]: Validating PTE special
[pte_protnone_tests ]: Validating PTE protnone
[pmd_protnone_tests ]: Validating PMD protnone
[pte_devmap_tests ]: Validating PTE devmap
[pmd_devmap_tests ]: Validating PMD devmap
[pte_swap_tests ]: Validating PTE swap
[swap_migration_tests ]: Validating swap migration
[hugetlb_basic_tests ]: Validating HugeTLB basic
[pmd_thp_tests ]: Validating PMD based THP
Signed-off-by: Anshuman Khandual <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Tested-by: Vineet Gupta <[email protected]> [arc]
Reviewed-by: Zi Yan <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Vineet Gupta <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Mike Rapoport <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This adds new tests validating for these following arch advanced page
table helpers. These tests create and test specific mapping types at
various page table levels.
1. pxxp_set_wrprotect()
2. pxxp_get_and_clear()
3. pxxp_set_access_flags()
4. pxxp_get_and_clear_full()
5. pxxp_test_and_clear_young()
6. pxx_leaf()
7. pxx_set_huge()
8. pxx_(clear|mk)_savedwrite()
9. huge_pxxp_xxx()
[[email protected]: drop RANDOM_ORVALUE from hugetlb_advanced_tests()]
Link: http://lkml.kernel.org/r/[email protected]
Suggested-by: Catalin Marinas <[email protected]>
Signed-off-by: Anshuman Khandual <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Tested-by: Vineet Gupta <[email protected]> [arc]
Reviewed-by: Zi Yan <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Vineet Gupta <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Steven Price <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "mm/debug_vm_pgtable: Add some more tests", v5.
This series adds some more arch page table helper validation tests which
are related to core and advanced memory functions. This also creates a
documentation, enlisting expected semantics for all page table helpers as
suggested by Mike Rapoport previously
(https://lkml.org/lkml/2020/1/30/40).
There are many TRANSPARENT_HUGEPAGE and ARCH_HAS_TRANSPARENT_HUGEPAGE_PUD
ifdefs scattered across the test. But consolidating all the fallback
stubs is not very straight forward because
ARCH_HAS_TRANSPARENT_HUGEPAGE_PUD is not explicitly dependent on
ARCH_HAS_TRANSPARENT_HUGEPAGE.
Tested on arm64, x86 platforms but only build tested on all other enabled
platforms through ARCH_HAS_DEBUG_VM_PGTABLE i.e powerpc, arc, s390. The
following failure on arm64 still exists which was mentioned previously.
It will be fixed with the upcoming THP migration on arm64 enablement
series.
WARNING .... mm/debug_vm_pgtable.c:860 debug_vm_pgtable+0x940/0xa54
WARN_ON(!pmd_present(pmd_mkinvalid(pmd_mkhuge(pmd))))
This patch (of 4):
This adds new tests validating arch page table helpers for these following
core memory features. These tests create and test specific mapping types
at various page table levels.
1. SPECIAL mapping
2. PROTNONE mapping
3. DEVMAP mapping
4. SOFTDIRTY mapping
5. SWAP mapping
6. MIGRATION mapping
7. HUGETLB mapping
8. THP mapping
Suggested-by: Catalin Marinas <[email protected]>
Signed-off-by: Anshuman Khandual <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Tested-by: Vineet Gupta <[email protected]> [arc]
Reviewed-by: Zi Yan <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Vineet Gupta <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Steven Price <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Provide the necessary KCSAN checks to assist with debugging racy
use-after-frees. While KASAN is more reliable at generally catching such
use-after-frees (due to its use of a quarantine), it can be difficult to
debug racy use-after-frees. If a reliable reproducer exists, KCSAN can
assist in debugging such issues.
Note: ASSERT_EXCLUSIVE_ACCESS is a convenience wrapper if the size is
simply sizeof(var). Instead, here we just use __kcsan_check_access()
explicitly to pass the correct size.
Signed-off-by: Marco Elver <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|