Age | Commit message (Collapse) | Author | Files | Lines |
|
debugfs_remove_recursive
Fix checkpatch warning:
"WARNING: debugfs_remove_recursive(NULL) is safe this check is probably not required"
Signed-off-by: Fabian Frederick <[email protected]>
Acked-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Historically, we exported shared pages to userspace via sysinfo(2)
sharedram and /proc/meminfo's "MemShared" fields. With the advent of
tmpfs, from kernel v2.4 onward, that old way for accounting shared mem
was deemed inaccurate and we started to export a hard-coded 0 for
sysinfo.sharedram. Later on, during the 2.6 timeframe, "MemShared" got
re-introduced to /proc/meminfo re-branded as "Shmem", but we're still
reporting sysinfo.sharedmem as that old hard-coded zero, which makes the
"shared memory" report inconsistent across interfaces.
This patch leverages the addition of explicit accounting for pages used
by shmem/tmpfs -- "4b02108 mm: oom analysis: add shmem vmstat" -- in
order to make the users of sysinfo(2) and si_meminfo*() friends aware of
that vmstat entry and make them report it consistently across the
interfaces, as well to make sysinfo(2) returned data consistent with our
current API documentation states.
Signed-off-by: Rafael Aquini <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Print a warning (if CONFIG_DEBUG_VM=y) when memory commitment becomes
too negative.
This shouldn't happen any more - the previous two patches fixed the
committed_as underflow issues.
[[email protected]: use VM_WARN_ONCE, per Dave]
Signed-off-by: Konstantin Khlebnikov <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Dave Hansen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
A shared anonymous mapping created without MAP_NORESERVE holds memory
reservation for whole range of shmem segment. Usually there is no way
to change its size, but /proc/<pid>/map_files/... (available if
CONFIG_CHECKPOINT_RESTORE=y) allows that.
This patch adjusts the memory reservation in shmem_setattr().
Signed-off-by: Konstantin Khlebnikov <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
If __shmem_file_setup() fails on struct file allocation it uncharges
memory commitment twice: first by shmem_unacct_size() and second time
implicitly in shmem_evict_inode() when it kills the newly created inode.
This patch removes shmem_unacct_size() from error path if the inode was
already there.
Signed-off-by: Konstantin Khlebnikov <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
It was missing...
Cc: Konstantin Khlebnikov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
tmp_mask in the __vmalloc_area_node() iteration never changes so it can
be moved into function scope and marked with const. This causes the
movl and orl to only be done once per call rather than area->nr_pages
times.
nested_gfp can also be marked const.
Signed-off-by: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
It is not uncommon on busy servers to get stuck hundred of ms in
vmalloc() calls (like file descriptor expansions).
Add a cond_resched() to __vmalloc_area_node() to be gentle to
other tasks.
[[email protected]: only do it for __GFP_WAIT, per David]
Signed-off-by: Eric Dumazet <[email protected]>
Cc: Hugh Dickins <[email protected]>
Acked-by: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Currently, we have more filesystems supporting fallocate, e.g
ext4/btrfs. Remove the outdated comment for madvise_remove.
Signed-off-by: Wang Sheng-Hui <[email protected]>
Reviewed-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The mm_migrate_pages trace event reports a reason for the migration,
typically as a symbolic string. The exception is the reason
MR_NUMA_MISPLACED for which it just displays the numeric value:
mm_migrate_pages: nr_succeeded=1 nr_failed=0 mode=MIGRATE_ASYNC
reason=0x5
This patch makes the output consistent by introducing a string value for
MR_NUMA_MISPLACED. The event is then reported as: mm_migrate_pages:
nr_succeeded=1 nr_failed=0 mode=MIGRATE_ASYNC reason=numa_misplaced
Signed-off-by: Max Asbock <[email protected]>
Acked-by: Steven Rostedt <[email protected]>
Cc: Ingo Molnar <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Reorder the members by input and output, then turn the individual
integers for may_writepage, may_unmap, may_swap, compaction_ready,
hibernation_mode into bit fields to save stack space:
+72/-296 -224
kswapd 104 176 +72
try_to_free_pages 80 56 -24
try_to_free_mem_cgroup_pages 80 56 -24
shrink_all_memory 88 64 -24
reclaim_clean_pages_from_list 168 144 -24
mem_cgroup_shrink_node_zone 104 80 -24
__zone_reclaim 176 152 -24
balance_pgdat 152 - -152
Signed-off-by: Johannes Weiner <[email protected]>
Suggested-by: Mel Gorman <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Rik van Riel <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Swappiness is determined for each scanned memcg individually in
shrink_zone() and is not a parameter that applies throughout the reclaim
scan. Move it out of struct scan_control to prevent accidental use of a
stale value.
Signed-off-by: Johannes Weiner <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rik van Riel <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Direct reclaim currently calls shrink_zones() to reclaim all members of
a zonelist, and if that wasn't successful it does another pass through
the same zonelist to check overall reclaimability.
Just check reclaimability in shrink_zones() directly and propagate the
result through the return value. Then remove all_unreclaimable().
Signed-off-by: Johannes Weiner <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rik van Riel <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Page reclaim for a higher-order page runs until compaction is ready,
then aborts and signals this situation through the return value of
shrink_zones(). This is an oddly specific signal to encode in the
return value of shrink_zones(), though, and can be quite confusing.
Introduce sc->compaction_ready and signal the compactability of the
zones out-of-band to free up the return value of shrink_zones() for
actual zone reclaimability.
Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Michal Hocko <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
shrink_zones() has a special branch to skip the all_unreclaimable()
check during hibernation, because a frozen kswapd can't mark a zone
unreclaimable.
But ever since commit 6e543d5780e3 ("mm: vmscan: fix
do_try_to_free_pages() livelock"), determining a zone to be
unreclaimable is done by directly looking at its scan history and no
longer relies on kswapd setting the per-zone flag.
Remove this branch and let shrink_zones() check the reclaimability of
the target zones regardless of hibernation state.
Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Rik van Riel <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
In original code, zone_movable_is_highmem() assumes ZONE_MOVABLE not
highmem if CONFIG_HAVE_MEMBLOCK_NODE_MAP is not set. In online_pages,
it extracts pages from the previous zone before ZONE_MOVABLE. Which is
logically inconsistent:
If HAVE_MEMBLOCK_NODE_MAP is turned off but HIGHMEM is on,
zone_movable_is_highmem() makes movable zone not highmem, but
online_pages() extracts pages from ZONE_HIGHMEM.
This inconsistency doesn't cause real problem currently, because all
architectures support online_pages also have HAVE_MEMBLOCK_NODE_MAP.
However, fixing it makes code clear, and also helps futher coding.
Signed-off-by: Wang Nan <[email protected]>
Cc: Zhang Zhen <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Jiang Liu <[email protected]>
Cc: Li Zefan <[email protected]>
Cc: Yinghai Lu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Use the newer and more pleasant kstrtoull() to replace
simple_strtoull(), because simple_strtoull() is marked for obsoletion.
Signed-off-by: Zhang Zhen <[email protected]>
Acked-by: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Kmem page charging and uncharging is serialized by means of exclusive
access to the page. Do not take the page_cgroup lock and don't set
pc->flags atomically.
Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Vladimir Davydov <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Tejun Heo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
There is a write barrier between setting pc->mem_cgroup and
PageCgroupUsed, which was added to allow LRU operations to lookup the
memcg LRU list of a page without acquiring the page_cgroup lock.
But ever since commit 38c5d72f3ebe ("memcg: simplify LRU handling by new
rule"), pages are ensured to be off-LRU while charging, so nobody else
is changing LRU state while pc->mem_cgroup is being written, and there
are no read barriers anymore.
Remove the unnecessary write barrier.
Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Due to an old optimization to keep expensive res_counter changes at a
minimum, the root_mem_cgroup res_counter is never charged; there is no
limit at that level anyway, and any statistics can be generated on
demand by summing up the counters of all other cgroups.
However, with per-cpu charge caches, res_counter operations do not even
show up in profiles anymore, so this optimization is no longer
necessary.
Remove it to simplify the code.
Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
When mem_cgroup_try_charge() returns -EINTR, it bypassed the charge to
the root memcg. But move precharging does not catch this and treats
this case as if no charge had happened, thus leaking a charge against
root. Because of an old optimization, the root memcg's res_counter is
not actually charged right now, but it's still an imbalance and
subsequent patches will charge the root memcg again.
Catch those bypasses to the root memcg and properly cancel them before
giving up the move.
Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The move precharge function does some baroque things: it tries raw
res_counter charging of the entire amount first, and then falls back to
a loop of one-by-one charges, with checks for pending signals and
cond_resched() batching.
Just use mem_cgroup_try_charge() without __GFP_WAIT for the first bulk
charge attempt. In the one-by-one loop, remove the signal check (this
is already checked in try_charge), and simply call cond_resched() after
every charge - it's not that expensive.
Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
For the page allocator, __GFP_NORETRY implies that no OOM should be
triggered, whereas memcg has an explicit parameter to disable OOM.
The only callsites that want OOM disabled are THP charges and charge
moving. THP already uses __GFP_NORETRY and charge moving can use it as
well - one full reclaim cycle should be plenty. Switch it over, then
remove the OOM parameter.
Signed-off-by: Johannes Weiner <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
There is no reason why oom-disabled and __GFP_NOFAIL charges should try
to reclaim only once when every other charge tries several times before
giving up. Make them all retry the same number of times.
Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Transparent huge page charges prefer falling back to regular pages
rather than spending a lot of time in direct reclaim.
Desired reclaim behavior is usually declared in the gfp mask, but THP
charges use GFP_KERNEL and then rely on the fact that OOM is disabled
for THP charges, and that OOM-disabled charges don't retry reclaim.
Needless to say, this is anything but obvious and quite error prone.
Convert THP charges to use GFP_TRANSHUGE instead, which implies
__GFP_NORETRY, to indicate the low-latency requirement.
Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Currently, __GFP_NORETRY tries charging once and gives up before even
trying to reclaim. Bring the behavior on par with the page allocator
and reclaim at least once before giving up.
Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The charging path currently starts out with OOM condition checks when
OOM is the rarest possible case.
Rearrange this code to run OOM/task dying checks only after trying the
percpu charge and the res_counter charge and bail out before entering
reclaim. Attempting a charge does not hurt an (oom-)killed task as much
as every charge attempt having to check OOM conditions. Also, only
check __GFP_NOFAIL when the charge would actually fail.
Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
These patches rework memcg charge lifetime to integrate more naturally
with the lifetime of user pages. This drastically simplifies the code
and reduces charging and uncharging overhead. The most expensive part
of charging and uncharging is the page_cgroup bit spinlock, which is
removed entirely after this series.
Here are the top-10 profile entries of a stress test that reads a 128G
sparse file on a freshly booted box, without even a dedicated cgroup
(i.e. executing in the root memcg). Before:
15.36% cat [kernel.kallsyms] [k] copy_user_generic_string
13.31% cat [kernel.kallsyms] [k] memset
11.48% cat [kernel.kallsyms] [k] do_mpage_readpage
4.23% cat [kernel.kallsyms] [k] get_page_from_freelist
2.38% cat [kernel.kallsyms] [k] put_page
2.32% cat [kernel.kallsyms] [k] __mem_cgroup_commit_charge
2.18% kswapd0 [kernel.kallsyms] [k] __mem_cgroup_uncharge_common
1.92% kswapd0 [kernel.kallsyms] [k] shrink_page_list
1.86% cat [kernel.kallsyms] [k] __radix_tree_lookup
1.62% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn
After:
15.67% cat [kernel.kallsyms] [k] copy_user_generic_string
13.48% cat [kernel.kallsyms] [k] memset
11.42% cat [kernel.kallsyms] [k] do_mpage_readpage
3.98% cat [kernel.kallsyms] [k] get_page_from_freelist
2.46% cat [kernel.kallsyms] [k] put_page
2.13% kswapd0 [kernel.kallsyms] [k] shrink_page_list
1.88% cat [kernel.kallsyms] [k] __radix_tree_lookup
1.67% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn
1.39% kswapd0 [kernel.kallsyms] [k] free_pcppages_bulk
1.30% cat [kernel.kallsyms] [k] kfree
As you can see, the memcg footprint has shrunk quite a bit.
text data bss dec hex filename
37970 9892 400 48262 bc86 mm/memcontrol.o.old
35239 9892 400 45531 b1db mm/memcontrol.o
This patch (of 13):
This function was split out because mem_cgroup_try_charge() got too big.
But having essentially one sequence of operations arbitrarily split in
half is not good for reworking the code. Fold it back in.
Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
- PAGEFLAG_FALSE only defines TEST, make it define SET and CLEAR as
well, analogous to PAGEFLAG.
- Define TESTSETFLAG_FALSE, analogous to TESTSETFLAG.
- Define TESTSCFLAG_FALSE, analogous to TESTSCFLAG
- Make PG_mlocked accessors the same on both MMU and !MMU setups
Signed-off-by: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
In some architectures like x86, atomic_add() is a full memory barrier.
In that case, an additional smp_mb() is just a waste of time. This
patch replaces that smp_mb() by smp_mb__after_atomic() which will avoid
the redundant memory barrier in some architectures.
With a 3.16-rc1 based kernel, this patch reduced the execution time of
breaking 1000 transparent huge pages from 38,245us to 30,964us. A
reduction of 19% which is quite sizeable. It also reduces the %cpu time
of the __split_huge_page_refcount function in the perf profile from
2.18% to 1.15%.
Signed-off-by: Waiman Long <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Scott J Norton <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
In __split_huge_page_map(), the check for page_mapcount(page) is
invariant within the for loop. Because of the fact that the macro is
implemented using atomic_read(), the redundant check cannot be optimized
away by the compiler leading to unnecessary read to the page structure.
This patch moves the invariant bug check out of the loop so that it will
be done only once. On a 3.16-rc1 based kernel, the execution time of a
microbenchmark that broke up 1000 transparent huge pages using munmap()
had an execution time of 38,245us and 38,548us with and without the
patch respectively. The performance gain is about 1%.
Signed-off-by: Waiman Long <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Scott J Norton <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
We don't need explicit 'CMA:' prefix, since we already define prefix
'cma:' in pr_fmt. So remove it.
Signed-off-by: Joonsoo Kim <[email protected]>
Acked-by: Michal Nazarewicz <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>
Cc: Alexander Graf <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Gleb Natapov <[email protected]>
Acked-by: Marek Szyprowski <[email protected]>
Tested-by: Marek Szyprowski <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Conventionally, we put output param to the end of param list and put the
'base' ahead of 'size', but cma_declare_contiguous() doesn't look like
that, so change it.
Additionally, move down cma_areas reference code to the position where
it is really needed.
Signed-off-by: Joonsoo Kim <[email protected]>
Acked-by: Michal Nazarewicz <[email protected]>
Reviewed-by: Aneesh Kumar K.V <[email protected]>
Cc: Alexander Graf <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Gleb Natapov <[email protected]>
Acked-by: Marek Szyprowski <[email protected]>
Tested-by: Marek Szyprowski <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Zhang Yanfei <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
We can remove one call sites for clear_cma_bitmap() if we first call it
before checking error number.
Signed-off-by: Joonsoo Kim <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Reviewed-by: Michal Nazarewicz <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
Reviewed-by: Aneesh Kumar K.V <[email protected]>
Cc: Alexander Graf <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Gleb Natapov <[email protected]>
Acked-by: Marek Szyprowski <[email protected]>
Tested-by: Marek Szyprowski <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Now, we have general CMA reserved area management framework, so use it
for future maintainabilty. There is no functional change.
Signed-off-by: Joonsoo Kim <[email protected]>
Acked-by: Michal Nazarewicz <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
Tested-by: Aneesh Kumar K.V <[email protected]>
Cc: Alexander Graf <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Gleb Natapov <[email protected]>
Acked-by: Marek Szyprowski <[email protected]>
Tested-by: Marek Szyprowski <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Zhang Yanfei <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Currently, there are two users on CMA functionality, one is the DMA
subsystem and the other is the KVM on powerpc. They have their own code
to manage CMA reserved area even if they looks really similar. From my
guess, it is caused by some needs on bitmap management. KVM side wants
to maintain bitmap not for 1 page, but for more size. Eventually it use
bitmap where one bit represents 64 pages.
When I implement CMA related patches, I should change those two places
to apply my change and it seem to be painful to me. I want to change
this situation and reduce future code management overhead through this
patch.
This change could also help developer who want to use CMA in their new
feature development, since they can use CMA easily without copying &
pasting this reserved area management code.
In previous patches, we have prepared some features to generalize CMA
reserved area management and now it's time to do it. This patch moves
core functions to mm/cma.c and change DMA APIs to use these functions.
There is no functional change in DMA APIs.
Signed-off-by: Joonsoo Kim <[email protected]>
Acked-by: Michal Nazarewicz <[email protected]>
Acked-by: Zhang Yanfei <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Reviewed-by: Aneesh Kumar K.V <[email protected]>
Cc: Alexander Graf <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Gleb Natapov <[email protected]>
Acked-by: Marek Szyprowski <[email protected]>
Tested-by: Marek Szyprowski <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
PPC KVM's CMA area management requires arbitrary bitmap granularity,
since they want to reserve very large memory and manage this region with
bitmap that one bit for several pages to reduce management overheads.
So support arbitrary bitmap granularity for following generalization.
[[email protected]: s/1/1UL/]
Signed-off-by: Joonsoo Kim <[email protected]>
Acked-by: Michal Nazarewicz <[email protected]>
Acked-by: Zhang Yanfei <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Reviewed-by: Aneesh Kumar K.V <[email protected]>
Cc: Alexander Graf <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Gleb Natapov <[email protected]>
Acked-by: Marek Szyprowski <[email protected]>
Tested-by: Marek Szyprowski <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
PPC KVM's CMA area management needs alignment constraint on CMA region.
So support it to prepare generalization of CMA area management
functionality.
Additionally, add some comments which tell us why alignment constraint
is needed on CMA region.
Signed-off-by: Joonsoo Kim <[email protected]>
Acked-by: Michal Nazarewicz <[email protected]>
Reviewed-by: Aneesh Kumar K.V <[email protected]>
Cc: Alexander Graf <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Gleb Natapov <[email protected]>
Acked-by: Marek Szyprowski <[email protected]>
Tested-by: Marek Szyprowski <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Zhang Yanfei <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
To prepare future generalization work on CMA area management code, we
need to separate core CMA management codes from DMA APIs. We will
extend these core functions to cover requirements of PPC KVM's CMA area
management functionality in following patches. This separation helps us
not to touch DMA APIs while extending core functions.
Signed-off-by: Joonsoo Kim <[email protected]>
Acked-by: Michal Nazarewicz <[email protected]>
Reviewed-by: Aneesh Kumar K.V <[email protected]>
Cc: Alexander Graf <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Gleb Natapov <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Zhang Yanfei <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Acked-by: Marek Szyprowski <[email protected]>
Tested-by: Marek Szyprowski <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Use nth_page instead of pfn_to_page(page_to_pfn
Signed-off-by: Fabian Frederick <[email protected]>
Cc: Rik van Riel <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Instead of open-coding getting minimal value of two, just use min macro.
That is why it is there for. While changing the function also change
type of batch local variable to match type of per_cpu_pages::batch
(which is int).
Signed-off-by: Michal Nazarewicz <[email protected]>
Acked-by: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
In store_mem_state(), we have:
...
334 else if (!strncmp(buf, "offline", min_t(int, count, 7)))
335 online_type = -1;
...
355 case -1:
356 ret = device_offline(&mem->dev);
357 break;
...
Here, "offline" is hard coded as -1.
This patch does the following renaming:
ONLINE_KEEP -> MMOP_ONLINE_KEEP
ONLINE_KERNEL -> MMOP_ONLINE_KERNEL
ONLINE_MOVABLE -> MMOP_ONLINE_MOVABLE
and introduces MMOP_OFFLINE = -1 to avoid hard coding.
Signed-off-by: Tang Chen <[email protected]>
Cc: Hu Tao <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Yasuaki Ishimatsu <[email protected]>
Cc: Gu Zheng <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
state of memory_block
We use the following command to online a memory_block:
echo online|online_kernel|online_movable > /sys/devices/system/memory/memoryXXX/state
But, if we do the following:
echo online_fhsjkghfkd > /sys/devices/system/memory/memoryXXX/state
the block will also be onlined.
This is because the following code in store_mem_state() does not compare
the whole string, but only the prefix of the string.
store_mem_state()
{
......
328 if (!strncmp(buf, "online_kernel", min_t(int, count, 13)))
Here, only compare the first 13 letters of the string. If we give "online_kernelXXXXXX",
it will be recognized as online_kernel, which is incorrect.
329 online_type = ONLINE_KERNEL;
330 else if (!strncmp(buf, "online_movable", min_t(int, count, 14)))
We have the same problem here,
331 online_type = ONLINE_MOVABLE;
332 else if (!strncmp(buf, "online", min_t(int, count, 6)))
here,
(Here is more problematic. If we give online_movalbe, which is a typo
of online_movable, it will be recognized as online without noticing the
author.)
333 online_type = ONLINE_KEEP;
334 else if (!strncmp(buf, "offline", min_t(int, count, 7)))
and here.
335 online_type = -1;
336 else {
337 ret = -EINVAL;
338 goto err;
339 }
......
}
This patch fixes this problem by using sysfs_streq() to compare the
whole string.
Signed-off-by: Tang Chen <[email protected]>
Reported-by: Hu Tao <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Yasuaki Ishimatsu <[email protected]>
Cc: Gu Zheng <[email protected]>
Acked-by: Toshi Kani <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Use ACCESS_ONCE() in handle_pte_fault() when getting the entry or
orig_pte upon which all subsequent decisions and pte_same() tests will
be made.
I have no evidence that its lack is responsible for the mm/filemap.c:202
BUG_ON(page_mapped(page)) in __delete_from_page_cache() found by
trinity, and I am not optimistic that it will fix it. But I have found
no other explanation, and ACCESS_ONCE() here will surely not hurt.
If gcc does re-access the pte before passing it down, then that would be
disastrous for correct page fault handling, and certainly could explain
the page_mapped() BUGs seen (concurrent fault causing page to be mapped
in a second time on top of itself: mapcount 2 for a single pte).
Signed-off-by: Hugh Dickins <[email protected]>
Cc: Sasha Levin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Konstantin Khlebnikov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Richard Yao reported a month ago that his system have a trouble with
vmap_area_lock contention during performance analysis by /proc/meminfo.
Andrew asked why his analysis checks /proc/meminfo stressfully, but he
didn't answer it.
https://lkml.org/lkml/2014/4/10/416
Although I'm not sure that this is right usage or not, there is a
solution reducing vmap_area_lock contention with no side-effect. That
is just to use rcu list iterator in get_vmalloc_info().
rcu can be used in this function because all RCU protocol is already
respected by writers, since Nick Piggin commit db64fe02258f1 ("mm:
rewrite vmap layer") back in linux-2.6.28
Specifically :
insertions use list_add_rcu(),
deletions use list_del_rcu() and kfree_rcu().
Note the rb tree is not used from rcu reader (it would not be safe),
only the vmap_area_list has full RCU protection.
Note that __purge_vmap_area_lazy() already uses this rcu protection.
rcu_read_lock();
list_for_each_entry_rcu(va, &vmap_area_list, list) {
if (va->flags & VM_LAZY_FREE) {
if (va->va_start < *start)
*start = va->va_start;
if (va->va_end > *end)
*end = va->va_end;
nr += (va->va_end - va->va_start) >> PAGE_SHIFT;
list_add_tail(&va->purge_list, &valist);
va->flags |= VM_LAZY_FREEING;
va->flags &= ~VM_LAZY_FREE;
}
}
rcu_read_unlock();
Peter:
: While rcu list traversal over the vmap_area_list is safe, this may
: arrive at different results than the spinlocked version. The rcu list
: traversal version will not be a 'snapshot' of a single, valid instant
: of the entire vmap_area_list, but rather a potential amalgam of
: different list states.
Joonsoo:
: Yes, you are right, but I don't think that we should be strict here.
: Meminfo is already not a 'snapshot' at specific time. While we try to get
: certain stats, the other stats can change. And, although we may arrive at
: different results than the spinlocked version, the difference would not be
: large and would not make serious side-effect.
[[email protected]: add more commit description]
Signed-off-by: Joonsoo Kim <[email protected]>
Reported-by: Richard Yao <[email protected]>
Acked-by: Eric Dumazet <[email protected]>
Cc: Peter Hurley <[email protected]>
Cc: Zhang Yanfei <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Andi Kleen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
memblock_set_bottom_up() is only called by __init
cmdline_parse_movable_node() and __init numa_init().
Signed-off-by: Fabian Frederick <[email protected]>
Reviewed-by: Tang Chen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
It is only called by mm/page_cgroup.c whcih cannot be modular.
Reported-by: David Rientjes <[email protected]>
Cc: Fabian Frederick <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
alloc_pages_exact_nid() is only called by __meminit alloc_page_cgroup()
Signed-off-by: Fabian Frederick <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
grow_zone_span and grow_pgdat_span are only called by
__meminit __add_zone
Signed-off-by: Fabian Frederick <[email protected]>
Cc: Toshi Kani <[email protected]>
Acked-by: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
count_history_pages does only call page_cache_prev_hole in rcu_lock
context using address_space mapping. There's no need to have
file_ra_state here.
Signed-off-by: Fabian Frederick <[email protected]>
Acked-by: Fengguang Wu <[email protected]>
Acked-by: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|