Age | Commit message (Collapse) | Author | Files | Lines |
|
Before commit c10d38cc8d3e ("mm, swap: bounds check swap_info array
accesses to avoid NULL derefs"), the typical code to reference the
swap_info[] is as follows,
type = swp_type(swp_entry);
if (type >= nr_swapfiles)
/* handle invalid swp_entry */;
p = swap_info[type];
/* access fields of *p. OOPS! p may be NULL! */
Because the ordering isn't guaranteed, it's possible that swap_info[type]
is read before "nr_swapfiles". And that may result in NULL pointer
dereference.
So after commit c10d38cc8d3e, the code becomes,
struct swap_info_struct *swap_type_to_swap_info(int type)
{
if (type >= READ_ONCE(nr_swapfiles))
return NULL;
smp_rmb();
return READ_ONCE(swap_info[type]);
}
/* users */
type = swp_type(swp_entry);
p = swap_type_to_swap_info(type);
if (!p)
/* handle invalid swp_entry */;
/* dereference p */
Where the value of swap_info[type] (that is, "p") is checked to be
non-zero before being dereferenced. So, the NULL deferencing becomes
impossible even if "nr_swapfiles" is read after swap_info[type].
Therefore, the "smp_rmb()" becomes unnecessary.
And, we don't even need to read "nr_swapfiles" here. Because the non-zero
checking for "p" is sufficient. We just need to make sure we will not
access out of the boundary of the array. With the change, nr_swapfiles
will only be accessed with swap_lock held, except in
swapcache_free_entries(). Where the absolute correctness of the value
isn't needed, as described in the comments.
We still need to guarantee swap_info[type] is read before being
dereferenced. That can be satisfied via the data dependency ordering
enforced by READ_ONCE(swap_info[type]). This needs to be paired with
proper write barriers. So smp_store_release() is used in
alloc_swap_info() to guarantee the fields of *swap_info[type] is
initialized before swap_info[type] itself being written. Note that the
fields of *swap_info[type] is initialized to be 0 via kvzalloc() firstly.
The assignment and deferencing of swap_info[type] is like
rcu_assign_pointer() and rcu_dereference().
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: "Huang, Ying" <[email protected]>
Cc: Daniel Jordan <[email protected]>
Cc: Dan Carpenter <[email protected]>
Cc: Andrea Parri <[email protected]>
Cc: Peter Zijlstra (Intel) <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Omar Sandoval <[email protected]>
Cc: Paul McKenney <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
deactivate_swap_slots_cache() and reactivate_swap_slots_cache() are only
called below their implementations. So these forward declarations are
meaningless and should be removed.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Since commit 55c653b71e8c ("mm: stop accounting shadow entries"),
nr_shadows is not used anymore.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "Cleanups for swap", v2.
This series contains just cleanups to remove some unused variables, delete
meaningless forward declarations and so on. More details can be found in
the respective changelogs.
This patch (of 4):
We should move get_swap_page_of_type() under CONFIG_HIBERNATION since the
only caller of this function is now suspend routine.
[[email protected]: move scan_swap_map() under CONFIG_HIBERNATION]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: fold scan_swap_map() into the only caller get_swap_page_of_type()]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
When I was investigating the swap code, I found the below possible race
window:
CPU 1 CPU 2
----- -----
shmem_swapin
swap_cluster_readahead
if (likely(si->flags & (SWP_BLKDEV | SWP_FS_OPS))) {
swapoff
..
si->swap_file = NULL;
..
struct inode *inode = si->swap_file->f_mapping->host;[oops!]
Close this race window by using get/put_swap_device() to guard against
concurrent swapoff.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 8fd2e0b505d1 ("mm: swap: check if swap backing device is congested or not")
Signed-off-by: Miaohe Lin <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Wei Yang <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Yu Zhao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The non_swap_entry() was used for working with VMA based swap readahead
via commit ec560175c0b6 ("mm, swap: VMA based swap readahead"). At that
time, the non_swap_entry() checking is necessary because the function is
called before checking that in do_swap_page(). Then it's moved to
swap_ra_info() since commit eaf649ebc3ac ("mm: swap: clean up swap
readahead"). After that, the non_swap_entry() checking is unnecessary,
because swap_ra_info() is called after non_swap_entry() has been checked
already. The resulting code is confusing as the non_swap_entry() check
looks racy now because while we released the pte lock, somebody else might
have faulted in this pte. So we should check whether it's swap pte first
to guard against such race or swap_type will be unexpected. But the race
isn't important because it will not cause problem. We would have enough
checking when we really operate the PTE entries later. So we remove the
non_swap_entry() check here to avoid confusion.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Wei Yang <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yu Zhao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
When I was investigating the swap code, I found the below possible race
window:
CPU 1 CPU 2
----- -----
do_swap_page
if (data_race(si->flags & SWP_SYNCHRONOUS_IO)
swap_readpage
if (data_race(sis->flags & SWP_FS_OPS)) {
swapoff
..
p->swap_file = NULL;
..
struct file *swap_file = sis->swap_file;
struct address_space *mapping = swap_file->f_mapping;[oops!]
Note that for the pages that are swapped in through swap cache, this isn't
an issue. Because the page is locked, and the swap entry will be marked
with SWAP_HAS_CACHE, so swapoff() can not proceed until the page has been
unlocked.
Fix this race by using get/put_swap_device() to guard against concurrent
swapoff.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 0bcac06f27d7 ("mm,swap: skip swapcache for swapin of synchronous device")
Signed-off-by: Miaohe Lin <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Wei Yang <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yu Zhao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "close various race windows for swap", v6.
When I was investigating the swap code, I found some possible race
windows. This series aims to fix all these races. But using current
get/put_swap_device() to guard against concurrent swapoff for
swap_readpage() looks terrible because swap_readpage() may take really
long time. And to reduce the performance overhead on the hot-path as much
as possible, it appears we can use the percpu_ref to close this race
window(as suggested by Huang, Ying). The patch 1 adds percpu_ref support
for swap and most of the remaining patches try to use this to close
various race windows. More details can be found in the respective
changelogs.
This patch (of 4):
Using current get/put_swap_device() to guard against concurrent swapoff
for some swap ops, e.g. swap_readpage(), looks terrible because they
might take really long time. This patch adds the percpu_ref support to
serialize against concurrent swapoff(as suggested by Huang, Ying). Also
we remove the SWP_VALID flag because it's used together with RCU solution.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Miaohe Lin <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
Cc: Alex Shi <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Wei Yang <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yu Zhao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Pagewalk ignores hugepd entries and walk down the tables as if it was
traditionnal entries, leading to crazy result.
Add walk_hugepd_range() and use it to walk hugepage tables.
Link: https://lkml.kernel.org/r/38d04410700c8d02f28ba37e020b62c55d6f3d2c.1624597695.git.christophe.leroy@csgroup.eu
Signed-off-by: Christophe Leroy <[email protected]>
Reviewed-by: Steven Price <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Daniel Axtens <[email protected]>
Cc: "Oliver O'Halloran" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
has_pinned 32bit can be packed in the MMF_HAS_PINNED bit as a noop
cleanup.
Any atomic_inc/dec to the mm cacheline shared by all threads in pin-fast
would reintroduce a loss of SMP scalability to pin-fast, so there's no
future potential usefulness to keep an atomic in the mm for this.
set_bit(MMF_HAS_PINNED) will be theoretically a bit slower than WRITE_ONCE
(atomic_set is equivalent to WRITE_ONCE), but the set_bit (just like
atomic_set after this commit) has to be still issued only once per "mm",
so the difference between the two will be lost in the noise.
will-it-scale "mmap2" shows no change in performance with enterprise
config as expected.
will-it-scale "pin_fast" retains the > 4000% SMP scalability performance
improvement against upstream as expected.
This is a noop as far as overall performance and SMP scalability are
concerned.
[[email protected]: pack has_pinned in MMF_HAS_PINNED]
Link: https://lkml.kernel.org/r/YJqWESqyxa8OZA+2@t490s
[[email protected]: coding style fixes]
[[email protected]: fix build for task_mmu.c, introduce mm_set_has_pinned_flag, fix comments]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Andrea Arcangeli <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Kirill Shutemov <[email protected]>
Cc: Kirill Tkhai <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
has_pinned cannot be written by each pin-fast or it won't scale in SMP.
This isn't "false sharing" strictly speaking (it's more like "true
non-sharing"), but it creates the same SMP scalability bottleneck of
"false sharing".
To verify the improvement, below test is done on 40 cpus host with
Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz (must be with
CONFIG_GUP_TEST=y):
$ sudo chrt -f 1 ./gup_test -a -m 512 -j 40
Where we can get (average value for 40 threads):
Old kernel: 477729.97 (+- 3.79%)
New kernel: 89144.65 (+-11.76%)
On a similar condition with 256 cpus, this commits increases the SMP
scalability of pin_user_pages_fast() executed by different threads of the
same process by more than 4000%.
[[email protected]: rewrite commit message, add parentheses against "(A & B)"]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Andrea Arcangeli <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Kirill Shutemov <[email protected]>
Cc: Kirill Tkhai <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Use __set_page_dirty_no_writeback() instead. This will set the dirty bit
on the page, which will be used to avoid calling set_page_dirty() in the
future. It will have no effect on actually writing the page back, as the
pages are not on any LRU lists.
[[email protected]: export __set_page_dirty_no_writeback() to modules]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Jan Kara <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This is fundamentally the same code, so just call it instead of
duplicating it.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Greg Kroah-Hartman <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Jan Kara <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "Further set_page_dirty cleanups".
Prompted by Christoph's recent patches, here are some more patches to
improve the state of set_page_dirty(). They're all from the folio tree,
so they've been tested to a certain extent.
This patch (of 6):
Nothing in __set_page_dirty() is specific to buffer_head, so move it to
mm/page-writeback.c. That removes the only caller of
account_page_dirtied() outside of page-writeback.c, so make it static.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Greg Kroah-Hartman <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Dan Williams <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Remove the CONFIG_BLOCK default to __set_page_dirty_buffers and just wire
that method up for the missing instances.
[[email protected]: ecryptfs: add a ->set_page_dirty cludge]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Greg Kroah-Hartman <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Tyler Hicks <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Asynchronously try to release dying cgwbs by switching attached inodes to
the nearest living ancestor wb. It helps to get rid of per-cgroup
writeback structures themselves and of pinned memory and block cgroups,
which are significantly larger structures (mostly due to large per-cpu
statistics data). This prevents memory waste and helps to avoid different
scalability problems caused by large piles of dying cgroups.
Reuse the existing mechanism of inode switching used for foreign inode
detection. To speed things up batch up to 115 inode switching in a single
operation (the maximum number is selected so that the resulting struct
inode_switch_wbs_context can fit into 1024 bytes). Because every
switching consists of two steps divided by an RCU grace period, it would
be too slow without batching. Please note that the whole batch counts as
a single operation (when increasing/decreasing isw_nr_in_flight). This
allows to keep umounting working (flush the switching queue), however
prevents cleanups from consuming the whole switching quota and effectively
blocking the frn switching.
A cgwb cleanup operation can fail due to different reasons (e.g. not
enough memory, the cgwb has an in-flight/pending io, an attached inode in
a wrong state, etc). In this case the next scheduled cleanup will make a
new attempt. An attempt is made each time a new cgwb is offlined (in
other words a memcg and/or a blkcg is deleted by a user). In the future
an additional attempt scheduled by a timer can be implemented.
[[email protected]: replace open-coded "115" with arithmetic]
Link: https://lkml.kernel.org/r/YMEcSBcq/[email protected]
[[email protected]: add smp_mb() to inode_prepare_wbs_switch()]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: fix documentation]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Roman Gushchin <[email protected]>
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Acked-by: Dennis Zhou <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jens Axboe <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Currently there is no way to iterate over inodes attached to a specific
cgwb structure. It limits the ability to efficiently reclaim the
writeback structure itself and associated memory and block cgroup
structures without scanning all inodes belonging to a sb, which can be
prohibitively expensive.
While dirty/in-active-writeback an inode belongs to one of the
bdi_writeback's io lists: b_dirty, b_io, b_more_io and b_dirty_time. Once
cleaned up, it's removed from all io lists. So the inode->i_io_list can
be reused to maintain the list of inodes, attached to a bdi_writeback
structure.
This patch introduces a new wb->b_attached list, which contains all inodes
which were dirty at least once and are attached to the given cgwb. Inodes
attached to the root bdi_writeback structures are never placed on such
list. The following patch will use this list to try to release cgwbs
structures more efficiently.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Roman Gushchin <[email protected]>
Suggested-by: Jan Kara <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Acked-by: Dennis Zhou <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jens Axboe <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
As account_page_dirtied() was always protected by xa_lock_irqsave(), so
using __this_cpu_inc() is better.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Chi Wu <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Cc: Howard Cochran <[email protected]>
Cc: Miklos Szeredi <[email protected]>
Cc: Sedat Dilek <[email protected]>
Cc: Tejun Heo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
As the value of pos_ratio_polynom() clamp between 0 and 2LL <<
RATELIMIT_CALC_SHIFT, the global control line should be consistent with
it.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Chi Wu <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Howard Cochran <[email protected]>
Cc: Miklos Szeredi <[email protected]>
Cc: Sedat Dilek <[email protected]>
Cc: Tejun Heo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Fix performance when BDI's share of ratio is 0.
The issue is similar to commit 74d369443325 ("writeback: Fix
performance regression in wb_over_bg_thresh()").
Balance_dirty_pages and the writeback worker will also disagree on
whether writeback when a BDI uses BDI_CAP_STRICTLIMIT and BDI's share
of the thresh ratio is zero.
For example, A thread on cpu0 writes 32 pages and then
balance_dirty_pages, it will wake up background writeback and pauses
because wb_dirty > wb->wb_thresh = 0 (share of thresh ratio is zero).
A thread may runs on cpu0 again because scheduler prefers pre_cpu.
Then writeback worker may runs on other cpus(1,2..) which causes the
value of wb_stat(wb, WB_RECLAIMABLE) in wb_over_bg_thresh is 0 and does
not writeback and returns.
Thus, balance_dirty_pages keeps looping, sleeping and then waking up the
worker who will do nothing. It remains stuck in this state until the
writeback worker hit the right dirty cpu or the dirty pages expire.
The fix that we should get the wb_stat_sum radically when thresh is low.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Chi Wu <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Miklos Szeredi <[email protected]>
Cc: Sedat Dilek <[email protected]>
Cc: Jens Axboe <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The get_writeback_state() has gone since 2006, kill related comments.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The page reporting order (threshold) is sticky to @pageblock_order by
default. The page reporting can never be triggered because the freeing
page can't come up with a free area like that huge. The situation becomes
worse when the system memory becomes heavily fragmented.
For example, the following configurations are used on ARM64 when 64KB base
page size is enabled. In this specific case, the page reporting won't be
triggered until the freeing page comes up with a 512MB free area. That's
hard to be met, especially when the system memory becomes heavily
fragmented.
PAGE_SIZE: 64KB
HPAGE_SIZE: 512MB
pageblock_order: 13 (512MB)
MAX_ORDER: 14
This allows the drivers to specify the page reporting order when the page
reporting device is registered. It falls back to @pageblock_order if it's
not specified by the driver. The existing users (hv_balloon and
virtio_balloon) don't specify it and @pageblock_order is still taken as
their page reporting order. So this shouldn't introduce any functional
changes.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Gavin Shan <[email protected]>
Reviewed-by: Alexander Duyck <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: "Michael S. Tsirkin" <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The macro PAGE_REPORTING_MIN_ORDER is defined as the page reporting
threshold. It can't be adjusted at runtime.
This introduces a variable (@page_reporting_order) to replace the marcro
(PAGE_REPORTING_MIN_ORDER). MAX_ORDER is assigned to it initially,
meaning the page reporting is disabled. It will be specified by driver if
valid one is provided. Otherwise, it will fall back to @pageblock_order.
It's also exported so that the page reporting order can be adjusted at
runtime.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Gavin Shan <[email protected]>
Suggested-by: David Hildenbrand <[email protected]>
Reviewed-by: Alexander Duyck <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: "Michael S. Tsirkin" <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "mm/page_reporting: Make page reporting work on arm64 with 64KB page size", v4.
The page reporting threshold is currently equal to @pageblock_order, which
is 13 and 512MB on arm64 with 64KB base page size selected. The page
reporting won't be triggered if the freeing page can't come up with a free
area like that huge. The condition is hard to be met, especially when the
system memory becomes fragmented.
This series intends to solve the issue by having page reporting threshold
as 5 (2MB) on arm64 with 64KB base page size. The patches are organized
as:
PATCH[1/4] Fix some coding style in __page_reporting_request().
PATCH[2/4] Represents page reporting order with variable so that it can
be exported as module parameter.
PATCH[3/4] Allows the device driver (e.g. virtio_balloon) to specify
the page reporting order when the device info is registered.
PATCH[4/4] Specifies the page reporting order to 5, corresponding to
2MB in size on ARM64 when 64KB base page size is used.
This patch (of 4):
The lines of comments would be starting with one, instead two space. This
corrects the style.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Gavin Shan <[email protected]>
Reviewed-by: Alexander Duyck <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: "Michael S. Tsirkin" <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
mmap_lock will explicitly disable/enable preemption upon manipulating its
local CPU variables. This is to be expected, but in this case, it doesn't
play well with PREEMPT_RT. The preemption disabled code section also
takes a spin-lock. Spin-locks in RT systems will try to schedule, which
is exactly what we're trying to avoid.
To mitigate this, convert the explicit preemption handling to local_locks.
Which are RT aware, and will disable migration instead of preemption when
PREEMPT_RT=y.
The faulty call trace looks like the following:
__mmap_lock_do_trace_*()
preempt_disable()
get_mm_memcg_path()
cgroup_path()
kernfs_path_from_node()
spin_lock_irqsave() /* Scheduling while atomic! */
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 2b5067a8143e3 ("mm: mmap_lock: add tracepoints around lock acquisition ")
Signed-off-by: Nicolas Saenz Julienne <[email protected]>
Tested-by: Axel Rasmussen <[email protected]>
Reviewed-by: Axel Rasmussen <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Steven Rostedt <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
On certain platforms, THP support could not just be validated via the
build option CONFIG_TRANSPARENT_HUGEPAGE. Instead
has_transparent_hugepage() also needs to be called upon to verify THP
runtime support. Otherwise the debug test will just run into unusable THP
helpers like in the case of a 4K hash config on powerpc platform [1].
This just moves all pfn_pmd() and pfn_pud() after THP runtime validation
with has_transparent_hugepage() which prevents the mentioned problem.
[1] https://bugzilla.kernel.org/show_bug.cgi?id=213069
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 787d563b8642 ("mm/debug_vm_pgtable: fix kernel crash by checking for THP support")
Signed-off-by: Anshuman Khandual <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Christophe Leroy <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This commit contains 3 modifications:
1. Convert the type of jiffies_scan_wait to "unsigned long".
2. Use READ/WRITE_ONCE() for accessing "jiffies_scan_wait".
3. Fix the possible wrong memory scanning period. If you set a large
memory scanning period like blow, then the "secs" variable will be
non-zero, however the value of "jiffies_scan_wait" will be zero.
echo "scan=0x10000000" > /sys/kernel/debug/kmemleak
It is because the type of the msecs_to_jiffies()'s parameter is "unsigned
int", and the "secs * 1000" is larger than its max value. This in turn
leads a unexpected jiffies_scan_wait, maybe zero. We corret it by
replacing kstrtoul() with kstrtouint(), and check the msecs to prevent it
larger than UINT_MAX.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yanfei Xu <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
When running the kernel with panic_on_taint, the usual slub debug error
messages are not being printed when object corruption happens. That's
because we panic in add_taint(), which is called before printing the
additional information. This is a bit unfortunate as the error messages
are actually very useful, especially before a panic. Let's fix this by
moving add_taint() after the errors are printed on the console.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Georgi Djakov <[email protected]>
Acked-by: Rafael Aquini <[email protected]>
Acked-by: David Rientjes <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Reviewed-by: Aaron Tomlin <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
alloc_calls and free_calls implementation in sysfs have two issues, one is
PAGE_SIZE limitation of sysfs and other is it does not adhere to "one
value per file" rule.
To overcome this issues, move the alloc_calls and free_calls
implementation to debugfs.
Debugfs cache will be created if SLAB_STORE_USER flag is set.
Rename the alloc_calls/free_calls to alloc_traces/free_traces, to be
inline with what it does.
[[email protected]: fix the leak of alloc/free traces debugfs interface]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Faiyaz Mohammed <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Reviewed-by: Greg Kroah-Hartman <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Obscuring the pointers that slub shows when debugging makes for some
confusing slub debug messages:
Padding overwritten. 0x0000000079f0674a-0x000000000d4dce17
Those addresses are hashed for kernel security reasons. If we're trying
to be secure with slub_debug on the commandline we have some big problems
given that we dump whole chunks of kernel memory to the kernel logs.
Let's force on the no_hash_pointers commandline flag when slub_debug is on
the commandline. This makes slub debug messages more meaningful and if by
chance a kernel address is in some slub debug object dump we will have a
better chance of figuring out what went wrong.
Note that we don't use %px in the slub code because we want to reduce the
number of places that %px is used in the kernel. This also nicely prints
a big fat warning at kernel boot if slub_debug is on the commandline so
that we know that this kernel shouldn't be used on production systems.
[[email protected]: fix build with CONFIG_SLUB_DEBUG=n]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Stephen Boyd <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Acked-by: Petr Mladek <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Muchun Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Ideally, slab_fix() would be marked with __printf and the format here
would not use \n as that's emitted by the slab_fix(). Make these changes.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Joe Perches <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Signed-off-by: Stephen Boyd <[email protected]>
Acked-by: David Rientjes <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Petr Mladek <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The message argument isn't used here. Let's pass the string to the printk
message so that the developer can figure out what's happening, instead of
guessing that a redzone is being restored, etc.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Stephen Boyd <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Acked-by: David Rientjes <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Petr Mladek <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Petch series "slub: Print non-hashed pointers in slub debugging", v3.
I was doing some debugging recently and noticed that my pointers were
being hashed while slub_debug was on the kernel commandline. Let's force
on the no hash pointer option when slub_debug is on the kernel commandline
so that the prints are more meaningful.
The first two patches are something else I noticed while looking at the
code. The message argument is never used so the debugging messages are
not as clear as they could be and the slub_debug=- behavior seems to be
busted. Then there's a printf fixup from Joe and the final patch is the
one that force disables pointer hashing.
This patch (of 4):
Passing slub_debug=- on the kernel commandline is supposed to disable slub
debugging. This is especially useful with CONFIG_SLUB_DEBUG_ON where the
default is to have slub debugging enabled in the build. Due to some code
reorganization this behavior was dropped, but the code to make it work
mostly stuck around. Restore the previous behavior by disabling the
static key when we parse the commandline and see that we're trying to
disable slub debugging.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: ca0cab65ea2b ("mm, slub: introduce static key for slub_debug()")
Signed-off-by: Stephen Boyd <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Petr Mladek <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Currently when size is not supported by kmalloc_index, compiler will
generate a run-time BUG() while compile-time error is also possible, and
better. So change BUG to BUILD_BUG_ON_MSG to make compile-time check
possible.
Also remove code that allocates more than 32MB because current
implementation supports only up to 32MB.
[[email protected]: fix support for clang 10]
Link: https://lkml.kernel.org/r/20210518181247.GA10062@hyeyoo
[[email protected]: fix false-positive assert in kernel/bpf/local_storage.c]
Link: https://lkml.kernel.org/r/[email protected]: https://lkml.kernel.org/r/20210511173448.GA54466@hyeyoo
[[email protected]: kfence fix]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Hyeonggon Yoo <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Signed-off-by: Marco Elver <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Marco Elver <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Function resiliency_test() is hidden behind #ifdef SLUB_RESILIENCY_TEST
that is not part of Kconfig, so nobody runs it.
This function is replaced with KUnit test for SLUB added by the previous
patch "selftests: add a KUnit test for SLUB debugging functionality".
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Oliver Glitta <[email protected]>
Reviewed-by: Marco Elver <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Acked-by: David Rientjes <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Oliver Glitta <[email protected]>
Cc: Brendan Higgins <[email protected]>
Cc: Daniel Latypov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
SLUB has resiliency_test() function which is hidden behind #ifdef
SLUB_RESILIENCY_TEST that is not part of Kconfig, so nobody runs it.
KUnit should be a proper replacement for it.
Try changing byte in redzone after allocation and changing pointer to next
free node, first byte, 50th byte and redzone byte. Check if validation
finds errors.
There are several differences from the original resiliency test: Tests
create own caches with known state instead of corrupting shared kmalloc
caches.
The corruption of freepointer uses correct offset, the original resiliency
test got broken with freepointer changes.
Scratch changing random byte test, because it does not have meaning in
this form where we need deterministic results.
Add new option CONFIG_SLUB_KUNIT_TEST in Kconfig. Tests next_pointer,
first_word and clobber_50th_byte do not run with KASAN option on. Because
the test deliberately modifies non-allocated objects.
Use kunit_resource to count errors in cache and silence bug reports.
Count error whenever slab_bug() or slab_fix() is called or when the count
of pages is wrong.
[[email protected]: remove unused function test_exit(), from SLUB KUnit test]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: export kasan_enable/disable_current to modules]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Oliver Glitta <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Acked-by: Daniel Latypov <[email protected]>
Acked-by: Marco Elver <[email protected]>
Cc: Brendan Higgins <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Pekka Enberg <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
It is better to use __func__ to trace function name.
Link: https://lkml.kernel.org/r/31fdbad5c45cd1e26be9ff37be321b8586b80fee.1624355507.git.gumingtao@xiaomi.com
Signed-off-by: gumingtao <[email protected]>
Acked-by: Christoph Lameter <[email protected]>
Acked-by: David Rientjes <[email protected]>
Reviewed-by: Aaron Tomlin <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
populated
Dave Jones reported the following
This made it into 5.13 final, and completely breaks NFSD for me
(Serving tcp v3 mounts). Existing mounts on clients hang, as do
new mounts from new clients. Rebooting the server back to rc7
everything recovers.
The commit b3b64ebd3822 ("mm/page_alloc: do bulk array bounds check after
checking populated elements") returns the wrong value if the array is
already populated which is interpreted as an allocation failure. Dave
reported this fixes his problem and it also passed a test running dbench
over NFS.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: b3b64ebd3822 ("mm/page_alloc: do bulk array bounds check after checking populated elements")
Signed-off-by: Mel Gorman <[email protected]>
Reported-by: Dave Jones <[email protected]>
Tested-by: Dave Jones <[email protected]>
Cc: Dan Carpenter <[email protected]>
Cc: Jesper Dangaard Brouer <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: <[email protected]> [5.13+]
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
On systems with memory nodes sorted in descending order, for instance Dell
Precision WorkStation T5500, the struct pages for higher PFNs and
respectively lower nodes, could be overwritten by the initialization of
struct pages corresponding to the holes in the memory sections.
For example for the below memory layout
[ 0.245624] Early memory node ranges
[ 0.248496] node 1: [mem 0x0000000000001000-0x0000000000090fff]
[ 0.251376] node 1: [mem 0x0000000000100000-0x00000000dbdf8fff]
[ 0.254256] node 1: [mem 0x0000000100000000-0x0000001423ffffff]
[ 0.257144] node 0: [mem 0x0000001424000000-0x0000002023ffffff]
the range 0x1424000000 - 0x1428000000 in the beginning of node 0 starts in
the middle of a section and will be considered as a hole during the
initialization of the last section in node 1.
The wrong initialization of the memory map causes panic on boot when
CONFIG_DEBUG_VM is enabled.
Reorder loop order of the memory map initialization so that the outer loop
will always iterate over populated memory regions in the ascending order
and the inner loop will select the zone corresponding to the PFN range.
This way initialization of the struct pages for the memory holes will be
always done for the ranges that are actually not populated.
[[email protected]: coding style fixes]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://bugzilla.kernel.org/show_bug.cgi?id=213073
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 0740a50b9baa ("mm/page_alloc.c: refactor initialization of struct page for holes in memory layout")
Signed-off-by: Mike Rapoport <[email protected]>
Cc: Boris Petkov <[email protected]>
Cc: Robert Shteynfeld <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
try_grab_compound_head() is used to grab a reference to a page from
get_user_pages_fast(), which is only protected against concurrent freeing
of page tables (via local_irq_save()), but not against concurrent TLB
flushes, freeing of data pages, or splitting of compound pages.
Because no reference is held to the page when try_grab_compound_head() is
called, the page may have been freed and reallocated by the time its
refcount has been elevated; therefore, once we're holding a stable
reference to the page, the caller re-checks whether the PTE still points
to the same page (with the same access rights).
The problem is that try_grab_compound_head() has to grab a reference on
the head page; but between the time we look up what the head page is and
the time we actually grab a reference on the head page, the compound page
may have been split up (either explicitly through split_huge_page() or by
freeing the compound page to the buddy allocator and then allocating its
individual order-0 pages). If that happens, get_user_pages_fast() may end
up returning the right page but lifting the refcount on a now-unrelated
page, leading to use-after-free of pages.
To fix it: Re-check whether the pages still belong together after lifting
the refcount on the head page. Move anything else that checks
compound_head(page) below the refcount increment.
This can't actually happen on bare-metal x86 (because there, disabling
IRQs locks out remote TLB flushes), but it can happen on virtualized x86
(e.g. under KVM) and probably also on arm64. The race window is pretty
narrow, and constantly allocating and shattering hugepages isn't exactly
fast; for now I've only managed to reproduce this in an x86 KVM guest with
an artificially widened timing window (by adding a loop that repeatedly
calls `inl(0x3f8 + 5)` in `try_get_compound_head()` to force VM exits, so
that PV TLB flushes are used instead of IPIs).
As requested on the list, also replace the existing VM_BUG_ON_PAGE() with
a warning and bailout. Since the existing code only performed the BUG_ON
check on DEBUG_VM kernels, ensure that the new code also only performs the
check under that configuration - I don't want to mix two logically
separate changes together too much. The macro VM_WARN_ON_ONCE_PAGE()
doesn't return a value on !DEBUG_VM, so wrap the whole check in an #ifdef
block. An alternative would be to change the VM_WARN_ON_ONCE_PAGE()
definition for !DEBUG_VM such that it always returns false, but since that
would differ from the behavior of the normal WARN macros, it might be too
confusing for readers.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 7aef4172c795 ("mm: handle PTE-mapped tail pages in gerneric fast gup implementaiton")
Signed-off-by: Jann Horn <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Dan Carpenter reported the following
The patch 0f87d9d30f21: "mm/page_alloc: add an array-based interface
to the bulk page allocator" from Apr 29, 2021, leads to the following
static checker warning:
mm/page_alloc.c:5338 __alloc_pages_bulk()
warn: potentially one past the end of array 'page_array[nr_populated]'
The problem can occur if an array is passed in that is fully populated.
That potentially ends up allocating a single page and storing it past
the end of the array. This patch returns 0 if the array is fully
populated.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 0f87d9d30f21 ("mm/page_alloc: add an array-based interface to the bulk page allocator")
Signed-off-by: Mel Gorman <[email protected]>
Reported-by: Dan Carpenter <[email protected]>
Cc: Jesper Dangaard Brouer <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
In the event that somebody would call this with an already fully
populated page_array, the last loop iteration would do an access beyond
the end of page_array.
It's of course extremely unlikely that would ever be done, but this
triggers my internal static analyzer. Also, if it really is not
supposed to be invoked this way (i.e., with no NULL entries in
page_array), the nr_populated<nr_pages check could simply be removed
instead.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 0f87d9d30f21 ("mm/page_alloc: add an array-based interface to the bulk page allocator")
Signed-off-by: Rasmus Villemoes <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Currently me_huge_page() temporary unlocks page to perform some actions
then locks it again later. My testcase (which calls hard-offline on
some tail page in a hugetlb, then accesses the address of the hugetlb
range) showed that page allocation code detects this page lock on buddy
page and printed out "BUG: Bad page state" message.
check_new_page_bad() does not consider a page with __PG_HWPOISON as bad
page, so this flag works as kind of filter, but this filtering doesn't
work in this case because the "bad page" is not the actual hwpoisoned
page. So stop locking page again. Actions to be taken depend on the
page type of the error, so page unlocking should be done in ->action()
callbacks. So let's make it assumed and change all existing callbacks
that way.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: commit 78bb920344b8 ("mm: hwpoison: dissolve in-use hugepage in unrecoverable memory error")
Signed-off-by: Naoya Horiguchi <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
When memory_failure() is called with MF_ACTION_REQUIRED on the page that
has already been hwpoisoned, memory_failure() could fail to send SIGBUS
to the affected process, which results in infinite loop of MCEs.
Currently memory_failure() returns 0 if it's called for already
hwpoisoned page, then the caller, kill_me_maybe(), could return without
sending SIGBUS to current process. An action required MCE is raised
when the current process accesses to the broken memory, so no SIGBUS
means that the current process continues to run and access to the error
page again soon, so running into MCE loop.
This issue can arise for example in the following scenarios:
- Two or more threads access to the poisoned page concurrently. If
local MCE is enabled, MCE handler independently handles the MCE
events. So there's a race among MCE events, and the second or latter
threads fall into the situation in question.
- If there was a precedent memory error event and memory_failure() for
the event failed to unmap the error page for some reason, the
subsequent memory access to the error page triggers the MCE loop
situation.
To fix the issue, make memory_failure() return an error code when the
error page has already been hwpoisoned. This allows memory error
handler to control how it sends signals to userspace. And make sure
that any process touching a hwpoisoned page should get a SIGBUS even in
"already hwpoisoned" path of memory_failure() as is done in page fault
path.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Aili Yao <[email protected]>
Signed-off-by: Naoya Horiguchi <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Jue Wang <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "mm,hwpoison: fix sending SIGBUS for Action Required MCE", v5.
I wrote this patchset to materialize what I think is the current
allowable solution mentioned by the previous discussion [1]. I simply
borrowed Tony's mutex patch and Aili's return code patch, then I queued
another one to find error virtual address in the best effort manner. I
know that this is not a perfect solution, but should work for some
typical case.
[1]: https://lore.kernel.org/linux-mm/20210331192540.2141052f@alex-virtual-machine/
This patch (of 2):
There can be races when multiple CPUs consume poison from the same page.
The first into memory_failure() atomically sets the HWPoison page flag
and begins hunting for tasks that map this page. Eventually it
invalidates those mappings and may send a SIGBUS to the affected tasks.
But while all that work is going on, other CPUs see a "success" return
code from memory_failure() and so they believe the error has been
handled and continue executing.
Fix by wrapping most of the internal parts of memory_failure() in a
mutex.
[[email protected]: make mf_mutex local to memory_failure()]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Tony Luck <[email protected]>
Signed-off-by: Naoya Horiguchi <[email protected]>
Reviewed-by: Borislav Petkov <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
Cc: Aili Yao <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Jue Wang <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
If more than one futex is placed on a shmem huge page, it can happen
that waking the second wakes the first instead, and leaves the second
waiting: the key's shared.pgoff is wrong.
When 3.11 commit 13d60f4b6ab5 ("futex: Take hugepages into account when
generating futex_key"), the only shared huge pages came from hugetlbfs,
and the code added to deal with its exceptional page->index was put into
hugetlb source. Then that was missed when 4.8 added shmem huge pages.
page_to_pgoff() is what others use for this nowadays: except that, as
currently written, it gives the right answer on hugetlbfs head, but
nonsense on hugetlbfs tails. Fix that by calling hugetlbfs-specific
hugetlb_basepage_index() on PageHuge tails as well as on head.
Yes, it's unconventional to declare hugetlb_basepage_index() there in
pagemap.h, rather than in hugetlb.h; but I do not expect anything but
page_to_pgoff() ever to need it.
[[email protected]: give hugetlb_basepage_index() prototype the correct scope]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
Reported-by: Neel Natu <[email protected]>
Signed-off-by: Hugh Dickins <[email protected]>
Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
Acked-by: Thomas Gleixner <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Zhang Yi <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Darren Hart <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
In commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings"),
__vmalloc_node_range was changed such that __get_vm_area_node was no
longer called with the requested/real size of the vmalloc allocation,
but rather with a rounded-up size.
This means that __get_vm_area_node called kasan_unpoision_vmalloc() with
a rounded up size rather than the real size. This led to it allowing
access to too much memory and so missing vmalloc OOBs and failing the
kasan kunit tests.
Pass the real size and the desired shift into __get_vm_area_node. This
allows it to round up the size for the underlying allocators while still
unpoisioning the correct quantity of shadow memory.
Adjust the other call-sites to pass in PAGE_SHIFT for the shift value.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://bugzilla.kernel.org/show_bug.cgi?id=213335
Fixes: 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings")
Signed-off-by: Daniel Axtens <[email protected]>
Tested-by: David Gow <[email protected]>
Reviewed-by: Nicholas Piggin <[email protected]>
Reviewed-by: Uladzislau Rezki (Sony) <[email protected]>
Tested-by: Andrey Konovalov <[email protected]>
Acked-by: Andrey Konovalov <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "mm: add vmalloc_no_huge and use it", v4.
Add vmalloc_no_huge() and export it, so modules can allocate memory with
small pages.
Use the newly added vmalloc_no_huge() in KVM on s390 to get around a
hardware limitation.
This patch (of 2):
Commit 121e6f3258fe3 ("mm/vmalloc: hugepage vmalloc mappings") added
support for hugepage vmalloc mappings, it also added the flag
VM_NO_HUGE_VMAP for __vmalloc_node_range to request the allocation to be
performed with 0-order non-huge pages.
This flag is not accessible when calling vmalloc, the only option is to
call directly __vmalloc_node_range, which is not exported.
This means that a module can't vmalloc memory with small pages.
Case in point: KVM on s390x needs to vmalloc a large area, and it needs
to be mapped with non-huge pages, because of a hardware limitation.
This patch adds the function vmalloc_no_huge, which works like vmalloc,
but it is guaranteed to always back the mapping using small pages. This
new function is exported, therefore it is usable by modules.
[[email protected]: whitespace fixes, per Christoph]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 121e6f3258fe3 ("mm/vmalloc: hugepage vmalloc mappings")
Signed-off-by: Claudio Imbrenda <[email protected]>
Reviewed-by: Uladzislau Rezki (Sony) <[email protected]>
Acked-by: Nicholas Piggin <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Acked-by: David Rientjes <[email protected]>
Cc: Uladzislau Rezki (Sony) <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Cornelia Huck <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Aha! Shouldn't that quick scan over pte_none()s make sure that it holds
ptlock in the PVMW_SYNC case? That too might have been responsible for
BUGs or WARNs in split_huge_page_to_list() or its unmap_page(), though
I've never seen any.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lore.kernel.org/linux-mm/[email protected]/
Fixes: ace71a19cec5 ("mm: introduce page_vma_mapped_walk()")
Signed-off-by: Hugh Dickins <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Tested-by: Wang Yugui <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Ralph Campbell <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Zi Yan <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Running certain tests with a DEBUG_VM kernel would crash within hours,
on the total_mapcount BUG() in split_huge_page_to_list(), while trying
to free up some memory by punching a hole in a shmem huge page: split's
try_to_unmap() was unable to find all the mappings of the page (which,
on a !DEBUG_VM kernel, would then keep the huge page pinned in memory).
Crash dumps showed two tail pages of a shmem huge page remained mapped
by pte: ptes in a non-huge-aligned vma of a gVisor process, at the end
of a long unmapped range; and no page table had yet been allocated for
the head of the huge page to be mapped into.
Although designed to handle these odd misaligned huge-page-mapped-by-pte
cases, page_vma_mapped_walk() falls short by returning false prematurely
when !pmd_present or !pud_present or !p4d_present or !pgd_present: there
are cases when a huge page may span the boundary, with ptes present in
the next.
Restructure page_vma_mapped_walk() as a loop to continue in these cases,
while keeping its layout much as before. Add a step_forward() helper to
advance pvmw->address across those boundaries: originally I tried to use
mm's standard p?d_addr_end() macros, but hit the same crash 512 times
less often: because of the way redundant levels are folded together, but
folded differently in different configurations, it was just too
difficult to use them correctly; and step_forward() is simpler anyway.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: ace71a19cec5 ("mm: introduce page_vma_mapped_walk()")
Signed-off-by: Hugh Dickins <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Ralph Campbell <[email protected]>
Cc: Wang Yugui <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Zi Yan <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|