aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2020-12-15mm/zswap: move to use crypto_acomp API for hardware accelerationBarry Song1-46/+137
Right now, all new ZIP drivers are adapted to crypto_acomp APIs rather than legacy crypto_comp APIs. Tradiontal ZIP drivers like lz4,lzo etc have been also wrapped into acomp via scomp backend. But zswap.c is still using the old APIs. That means zswap won't be able to work on any new ZIP drivers in kernel. This patch moves to use cryto_acomp APIs to fix the disconnected bridge between new ZIP drivers and zswap. It is probably the first real user to use acomp but perhaps not a good example to demonstrate how multiple acomp requests can be executed in parallel in one acomp instance. frontswap is doing page load and store page by page synchronously. swap_writepage() depends on the completion of frontswap_store() to decide if it should call __swap_writepage() to swap to disk. However this patch creates multiple acomp instances, so multiple threads running on multiple different cpus can actually do (de)compression parallelly, leveraging the power of multiple ZIP hardware queues. This is also consistent with frontswap's page management model. The old zswap code uses atomic context and avoids the race conditions while shared resources like zswap_dstmem are accessed. Here since acomp can sleep, per-cpu mutex is used to replace preemption-disable. While it is possible to make mm/page_io.c and mm/frontswap.c support async (de)compression in some way, the entire design requires careful thinking and performance evaluation. For the first step, the base with fixed connection between ZIP drivers and zswap should be built. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Barry Song <[email protected]> Acked-by: Vitaly Wool <[email protected]> Cc: Luis Claudio R. Goncalves <[email protected]> Cc: Sebastian Andrzej Siewior <[email protected]> Cc: Herbert Xu <[email protected]> Cc: David S. Miller <[email protected]> Cc: Mahipal Challa <[email protected]> Cc: Seth Jennings <[email protected]> Cc: Dan Streetman <[email protected]> Cc: Zhou Wang <[email protected]> Cc: Colin Ian King <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-12-15mm/zswap: fix passing zero to 'PTR_ERR' warningYueHaibing1-1/+1
Fix smatch warning: mm/zswap.c:425 zswap_cpu_comp_prepare() warn: passing zero to 'PTR_ERR' crypto_alloc_comp() never return NULL, use IS_ERR instead of IS_ERR_OR_NULL to fix this. Link: https://lkml.kernel.org/r/[email protected] Fixes: f1c54846ee45 ("zswap: dynamic pool creation") Signed-off-by: YueHaibing <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Seth Jennings <[email protected]> Cc: Dan Streetman <[email protected]> Cc: Vitaly Wool <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-12-15mm/zswap: make struct kernel_param_ops definitions constJoe Perches1-3/+3
These should be const, so make it so. Link: https://lkml.kernel.org/r/1791535ee0b00f4a5c68cc4a8adada06593ad8f1.1601770305.git.joe@perches.com Signed-off-by: Joe Perches <[email protected]> Cc: Seth Jennings <[email protected]> Cc: Dan Streetman <[email protected]> Cc: Vitaly Wool <[email protected]> Cc: "Maciej S. Szmigiero" <[email protected]> Cc: Dan Carpenter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-12-15userfaultfd/selftests: hint the test runner on required privilegePeter Xu1-1/+2
Now userfaultfd test program requires either root or ptrace privilege due to the signal/event tests. When UFFDIO_API failed, hint the test runner about this fact verbosely. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peter Xu <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Mike Rapoport <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-12-15userfaultfd/selftests: fix retval check for userfaultfd_open()Peter Xu1-4/+4
userfaultfd_open() returns 1 for errors rather than negatives. Fix it on all the callers so when UFFDIO_API failed the test will bail out. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peter Xu <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Mike Rapoport <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-12-15userfaultfd/selftests: always dump something in modesPeter Xu1-0/+2
Patch series "userfaultfd: selftests: Small fixes". Some very trivial fixes that I kept locally to userfaultfd selftest program. This patch (of 3): BOUNCE_POLL is a special bit that if cleared it means "READ" instead. Dump that too otherwise we'll see tests with empty modes. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peter Xu <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Mike Rapoport <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-12-15userfaultfd: selftests: make __{s,u}64 format specifiers portableAxel Rasmussen1-46/+35
On certain platforms (powerpcle is the one on which I ran into this), "%Ld" and "%Lu" are unsuitable for printing __s64 and __u64, respectively, resulting in build warnings. Cast to {u,}int64_t, and use the PRI{d,u}64 macros defined in inttypes.h to print them. This ought to be portable to all platforms. Splitting this off into a separate macro lets us remove some lines, and get rid of some (I would argue) stylistically odd cases where we joined printf() and exit() into a single statement with a ,. Finally, this also fixes a "missing braces around initializer" warning when we initialize prms in wp_range(). [[email protected]: v2] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Axel Rasmussen <[email protected]> Acked-by: Peter Xu <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Joe Perches <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: David Alan Gilbert <[email protected]> Cc: Greg Thelen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-12-15userfaultfd: add user-mode only option to unprivileged_userfaultfd sysctl knobLokesh Gidra2-7/+18
With this change, when the knob is set to 0, it allows unprivileged users to call userfaultfd, like when it is set to 1, but with the restriction that page faults from only user-mode can be handled. In this mode, an unprivileged user (without SYS_CAP_PTRACE capability) must pass UFFD_USER_MODE_ONLY to userfaultd or the API will fail with EPERM. This enables administrators to reduce the likelihood that an attacker with access to userfaultfd can delay faulting kernel code to widen timing windows for other exploits. The default value of this knob is changed to 0. This is required for correct functioning of pipe mutex. However, this will fail postcopy live migration, which will be unnoticeable to the VM guests. To avoid this, set 'vm.userfault = 1' in /sys/sysctl.conf. The main reason this change is desirable as in the short term is that the Android userland will behave as with the sysctl set to zero. So without this commit, any Linux binary using userfaultfd to manage its memory would behave differently if run within the Android userland. For more details, refer to Andrea's reply [1]. [1] https://lore.kernel.org/lkml/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Lokesh Gidra <[email protected]> Reviewed-by: Andrea Arcangeli <[email protected]> Cc: Kees Cook <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Peter Xu <[email protected]> Cc: Sebastian Andrzej Siewior <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Stephen Smalley <[email protected]> Cc: Eric Biggers <[email protected]> Cc: Daniel Colascione <[email protected]> Cc: "Joel Fernandes (Google)" <[email protected]> Cc: Kalesh Singh <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Jeff Vander Stoep <[email protected]> Cc: <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Mauro Carvalho Chehab <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Nitin Gupta <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Iurii Zaikin <[email protected]> Cc: Luis Chamberlain <[email protected]> Cc: Daniel Colascione <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-12-15userfaultfd: add UFFD_USER_MODE_ONLYLokesh Gidra2-1/+18
Patch series "Control over userfaultfd kernel-fault handling", v6. This patch series is split from [1]. The other series enables SELinux support for userfaultfd file descriptors so that its creation and movement can be controlled. It has been demonstrated on various occasions that suspending kernel code execution for an arbitrary amount of time at any access to userspace memory (copy_from_user()/copy_to_user()/...) can be exploited to change the intended behavior of the kernel. For instance, handling page faults in kernel-mode using userfaultfd has been exploited in [2, 3]. Likewise, FUSE, which is similar to userfaultfd in this respect, has been exploited in [4, 5] for similar outcome. This small patch series adds a new flag to userfaultfd(2) that allows callers to give up the ability to handle kernel-mode faults with the resulting UFFD file object. It then adds a 'user-mode only' option to the unprivileged_userfaultfd sysctl knob to require unprivileged callers to use this new flag. The purpose of this new interface is to decrease the chance of an unprivileged userfaultfd user taking advantage of userfaultfd to enhance security vulnerabilities by lengthening the race window in kernel code. [1] https://lore.kernel.org/lkml/[email protected]/ [2] https://duasynt.com/blog/linux-kernel-heap-spray [3] https://duasynt.com/blog/cve-2016-6187-heap-off-by-one-exploit [4] https://googleprojectzero.blogspot.com/2016/06/exploiting-recursion-in-linux-kernel_20.html [5] https://bugs.chromium.org/p/project-zero/issues/detail?id=808 This patch (of 2): userfaultfd handles page faults from both user and kernel code. Add a new UFFD_USER_MODE_ONLY flag for userfaultfd(2) that makes the resulting userfaultfd object refuse to handle faults from kernel mode, treating these faults as if SIGBUS were always raised, causing the kernel code to fail with EFAULT. A future patch adds a knob allowing administrators to give some processes the ability to create userfaultfd file objects only if they pass UFFD_USER_MODE_ONLY, reducing the likelihood that these processes will exploit userfaultfd's ability to delay kernel page faults to open timing windows for future exploits. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Daniel Colascione <[email protected]> Signed-off-by: Lokesh Gidra <[email protected]> Reviewed-by: Andrea Arcangeli <[email protected]> Cc: Alexander Viro <[email protected]> Cc: <[email protected]> Cc: Daniel Colascione <[email protected]> Cc: Eric Biggers <[email protected]> Cc: Iurii Zaikin <[email protected]> Cc: Jeff Vander Stoep <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: "Joel Fernandes (Google)" <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Kalesh Singh <[email protected]> Cc: Kees Cook <[email protected]> Cc: Luis Chamberlain <[email protected]> Cc: Mauro Carvalho Chehab <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Nitin Gupta <[email protected]> Cc: Peter Xu <[email protected]> Cc: Sebastian Andrzej Siewior <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Stephen Smalley <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-12-15mm, page_poison: remove CONFIG_PAGE_POISONING_ZEROVlastimil Babka4-28/+2
CONFIG_PAGE_POISONING_ZERO uses the zero pattern instead of 0xAA. It was introduced by commit 1414c7f4f7d7 ("mm/page_poisoning.c: allow for zero poisoning"), noting that using zeroes retains the benefit of sanitizing content of freed pages, with the benefit of not having to zero them again on alloc, and the downside of making some forms of corruption (stray writes of NULLs) harder to detect than with the 0xAA pattern. Together with CONFIG_PAGE_POISONING_NO_SANITY it made possible to sanitize the contents on free without checking it back on alloc. These days we have the init_on_free() option to achieve sanitization with zeroes and to save clearing on alloc (and without checking on alloc). Arguably if someone does choose to check the poison for corruption on alloc, the savings of not clearing the page are secondary, and it makes sense to always use the 0xAA poison pattern. Thus, remove the CONFIG_PAGE_POISONING_ZERO option for being redundant. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vlastimil Babka <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Kees Cook <[email protected]> Cc: Laura Abbott <[email protected]> Cc: Mateusz Nosek <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-12-15mm, page_poison: remove CONFIG_PAGE_POISONING_NO_SANITYVlastimil Babka3-17/+5
CONFIG_PAGE_POISONING_NO_SANITY skips the check on page alloc whether the poison pattern was corrupted, suggesting a use-after-free. The motivation to introduce it in commit 8823b1dbc05f ("mm/page_poison.c: enable PAGE_POISONING as a separate option") was to simply sanitize freed pages, optimally together with CONFIG_PAGE_POISONING_ZERO. These days we have an init_on_free=1 boot option, which makes this use case of page poisoning redundant. For sanitizing, writing zeroes is sufficient, there is pretty much no benefit from writing the 0xAA poison pattern to freed pages, without checking it back on alloc. Thus, remove this option and suggest init_on_free instead in the main config's help. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vlastimil Babka <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Kees Cook <[email protected]> Cc: Laura Abbott <[email protected]> Cc: Mateusz Nosek <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-12-15kernel/power: allow hibernation with page_poison sanity checkingVlastimil Babka5-6/+14
Page poisoning used to be incompatible with hibernation, as the state of poisoned pages was lost after resume, thus enabling CONFIG_HIBERNATION forces CONFIG_PAGE_POISONING_NO_SANITY. For the same reason, the poisoning with zeroes variant CONFIG_PAGE_POISONING_ZERO used to disable hibernation. The latter restriction was removed by commit 1ad1410f632d ("PM / Hibernate: allow hibernation with PAGE_POISONING_ZERO") and similarly for init_on_free by commit 18451f9f9e58 ("PM: hibernate: fix crashes with init_on_free=1") by making sure free pages are cleared after resume. We can use the same mechanism to instead poison free pages with PAGE_POISON after resume. This covers both zero and 0xAA patterns. Thus we can remove the Kconfig restriction that disables page poison sanity checking when hibernation is enabled. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vlastimil Babka <[email protected]> Acked-by: Rafael J. Wysocki <[email protected]> [hibernation] Reviewed-by: David Hildenbrand <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Kees Cook <[email protected]> Cc: Laura Abbott <[email protected]> Cc: Mateusz Nosek <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-12-15mm, page_poison: use static key more efficientlyVlastimil Babka4-54/+52
Commit 11c9c7edae06 ("mm/page_poison.c: replace bool variable with static key") changed page_poisoning_enabled() to a static key check. However, the function is not inlined, so each check still involves a function call with overhead not eliminated when page poisoning is disabled. Analogically to how debug_pagealloc is handled, this patch converts page_poisoning_enabled() back to boolean check, and introduces page_poisoning_enabled_static() for fast paths. Both functions are inlined. The function kernel_poison_pages() is also called unconditionally and does the static key check inside. Remove it from there and put it to callers. Also split it to two functions kernel_poison_pages() and kernel_unpoison_pages() instead of the confusing bool parameter. Also optimize the check that enables page poisoning instead of debug_pagealloc for architectures without proper debug_pagealloc support. Move the check to init_mem_debugging_and_hardening() to enable a single static key instead of having two static branches in page_poisoning_enabled_static(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vlastimil Babka <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Kees Cook <[email protected]> Cc: Laura Abbott <[email protected]> Cc: Mateusz Nosek <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-12-15mm, page_alloc: do not rely on the order of page_poison and ↵Vlastimil Babka3-64/+46
init_on_alloc/free parameters Patch series "cleanup page poisoning", v3. I have identified a number of issues and opportunities for cleanup with CONFIG_PAGE_POISON and friends: - interaction with init_on_alloc and init_on_free parameters depends on the order of parameters (Patch 1) - the boot time enabling uses static key, but inefficienty (Patch 2) - sanity checking is incompatible with hibernation (Patch 3) - CONFIG_PAGE_POISONING_NO_SANITY can be removed now that we have init_on_free (Patch 4) - CONFIG_PAGE_POISONING_ZERO can be most likely removed now that we have init_on_free (Patch 5) This patch (of 5): Enabling page_poison=1 together with init_on_alloc=1 or init_on_free=1 produces a warning in dmesg that page_poison takes precedence. However, as these warnings are printed in early_param handlers for init_on_alloc/free, they are not printed if page_poison is enabled later on the command line (handlers are called in the order of their parameters), or when init_on_alloc/free is always enabled by the respective config option - before the page_poison early param handler is called, it is not considered to be enabled. This is inconsistent. We can remove the dependency on order by making the init_on_* parameters only set a boolean variable, and postponing the evaluation after all early params have been processed. Introduce a new init_mem_debugging_and_hardening() function for that, and move the related debug_pagealloc processing there as well. As a result init_mem_debugging_and_hardening() knows always accurately if init_on_* and/or page_poison options were enabled. Thus we can also optimize want_init_on_alloc() and want_init_on_free(). We don't need to check page_poisoning_enabled() there, we can instead not enable the init_on_* static keys at all, if page poisoning is enabled. This results in a simpler and more effective code. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vlastimil Babka <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Mike Rapoport <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Kees Cook <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mateusz Nosek <[email protected]> Cc: Laura Abbott <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-12-15mm: cma: improve pr_debug log in cma_release()Charan Teja Reddy1-1/+1
It is required to print 'count' of pages, along with the pages, passed to cma_release to debug the cases of mismatched count value passed between cma_alloc() and cma_release() from a code path. As an example, consider the below scenario: 1) CMA pool size is 4MB and 2) User doing the erroneous step of allocating 2 pages but freeing 1 page in a loop from this CMA pool. The step 2 causes cma_alloc() to return NULL at one point of time because of -ENOMEM condition. And the current pr_debug logs is not giving the info about these types of allocation patterns because of count value not being printed in cma_release(). We are printing the count value in the trace logs, just extend the same to pr_debug logs too. [[email protected]: fix printk warning] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Charan Teja Reddy <[email protected]> Reviewed-by: Souptick Joarder <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Vinayak Menon <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-12-15mm/cma.c: remove redundant cma_mutex lockLecopzer Chen1-3/+1
The cma_mutex which protects alloc_contig_range() was first appeared in commit 7ee793a62fa8c ("cma: Remove potential deadlock situation"), at that time, there is no guarantee the behavior of concurrency inside alloc_contig_range(). After commit 2c7452a075d4db2dc ("mm/page_isolation.c: make start_isolate_page_range() fail if already isolated") > However, two subsystems (CMA and gigantic > huge pages for example) could attempt operations on the same range. If > this happens, one thread may 'undo' the work another thread is doing. > This can result in pageblocks being incorrectly left marked as > MIGRATE_ISOLATE and therefore not available for page allocation. The concurrency inside alloc_contig_range() was clarified. Now we can find that hugepage and virtio call alloc_contig_range() without any lock, thus cma_mutex is "redundant" in cma_alloc() now. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Lecopzer Chen <[email protected]> Acked-by: David Hildenbrand <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Matthias Brugger <[email protected]> Cc: YJ Chiang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-12-15mm: migrate: remove unused parameter in migrate_vma_insert_page()Stephen Zhang1-4/+2
"dst" parameter to migrate_vma_insert_page() is not used anymore. Link: https://lkml.kernel.org/r/CANubcdUwCAMuUyamG2dkWP=cqSR9MAS=tHLDc95kQkqU-rEnAg@mail.gmail.com Signed-off-by: Stephen Zhang <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-12-15mm: migrate: return -ENOSYS if THP migration is unsupportedYang Shi1-16/+46
In the current implementation unmap_and_move() would return -ENOMEM if THP migration is unsupported, then the THP will be split. If split is failed just exit without trying to migrate other pages. It doesn't make too much sense since there may be enough free memory to migrate other pages and there may be a lot base pages on the list. Return -ENOSYS to make consistent with hugetlb. And if THP split is failed just skip and try other pages on the list. Just skip the whole list and exit when free memory is really low. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Shi <[email protected]> Cc: Jan Kara <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Song Liu <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-12-15mm: migrate: clean up migrate_prep{_local}Yang Shi3-14/+6
The migrate_prep{_local} never fails, so it is pointless to have return value and check the return value. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Shi <[email protected]> Reviewed-by: Zi Yan <[email protected]> Cc: Jan Kara <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Song Liu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-12-15mm: migrate: skip shared exec THP for NUMA balancingYang Shi1-2/+16
The NUMA balancing skip shared exec base page. Since CONFIG_READ_ONLY_THP_FOR_FS was introduced, there are probably shared exec THP, so skip such THPs for NUMA balancing as well. And Willy's regular filesystem THP support patches could create shared exec THP wven without that config. In addition, the page_is_file_lru() is used to tell if the page is file cache or not, but it filters out shmem page. It sounds like a typical usecase by putting executables in shmem to achieve performance gain via using shmem-THP, so it sounds worth skipping migration for such case too. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Shi <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Jan Kara <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Song Liu <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-12-15mm: migrate: simplify the logic for handling permanent failureYang Shi1-30/+38
When unmap_and_move{_huge_page}() returns !-EAGAIN and !MIGRATEPAGE_SUCCESS, the page would be put back to LRU or proper list if it is non-LRU movable page. But, the callers always call putback_movable_pages() to put the failed pages back later on, so it seems not very efficient to put every single page back immediately, and the code looks convoluted. Put the failed page on a separate list, then splice the list to migrate list when all pages are tried. It is the caller's responsibility to call putback_movable_pages() to handle failures. This also makes the code simpler and more readable. After the change the rules are: * Success: non hugetlb page will be freed, hugetlb page will be put back * -EAGAIN: stay on the from list * -ENOMEM: stay on the from list * Other errno: put on ret_pages list then splice to from list The from list would be empty iff all pages are migrated successfully, it was not so before. This has no impact to current existing callsites. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Shi <[email protected]> Reviewed-by: Zi Yan <[email protected]> Cc: Jan Kara <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Song Liu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-12-15mm: truncate_complete_page() does not exist any moreYang Shi2-2/+2
Patch series "mm: misc migrate cleanup and improvement", v3. This patch (of 5): The commit 9f4e41f4717832e ("mm: refactor truncate_complete_page()") refactored truncate_complete_page(), and it is not existed anymore, correct the comment in vmscan and migrate to avoid confusion. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Shi <[email protected]> Reviewed-by: Jan Kara <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Zi Yan <[email protected]> Cc: Song Liu <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-12-15mm: support THPs in zero_user_segmentsMatthew Wilcox (Oracle)2-4/+67
We can only kmap() one subpage of a THP at a time, so loop over all relevant subpages, skipping ones which don't need to be zeroed. This is too large to inline when THPs are enabled and we actually need highmem, so put it in highmem.c. [[email protected]: start1 was allowed to be less than start2] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: Yang Shi <[email protected]> Cc: Jan Kara <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Zi Yan <[email protected]> Cc: Song Liu <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Naresh Kamboju <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-12-15mm/migrate.c: optimize migrate_vma_pages() mmu notifierRalph Campbell1-5/+4
When migrating a zero page or pte_none() anonymous page to device private memory, migrate_vma_setup() will initialize the src[] array with a NULL PFN. This lets the device driver allocate device private memory and clear it instead of DMAing a page of zeros over the device bus. Since the source page didn't exist at the time, no struct page was locked nor a migration PTE inserted into the CPU page tables. The actual PTE insertion happens in migrate_vma_pages() when it tries to insert the device private struct page PTE into the CPU page tables. migrate_vma_pages() has to call the mmu notifiers again since another device could fault on the same page before the page table locks are acquired. Allow device drivers to optimize the invalidation similar to migrate_vma_setup() by calling mmu_notifier_range_init() which sets struct mmu_notifier_range event type to MMU_NOTIFY_MIGRATE and the migrate_pgmap_owner field. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ralph Campbell <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: John Hubbard <[email protected]> Cc: Alistair Popple <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Jason Gunthorpe <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-12-15mm/migrate.c: fix comment spellingLong Li1-1/+1
The word in the comment is misspelled, it should be "include". Link: https://lkml.kernel.org/r/20201024114144.GA20552@lilong Signed-off-by: Long Li <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-12-15mm/oom_kill: change comment and rename is_dump_unreclaim_slabs()Hui Su1-6/+8
Change the comment of is_dump_unreclaim_slabs(), it just check whether nr_unreclaimable slabs amount is greater than user memory, and explain why we dump unreclaim slabs. Rename it to should_dump_unreclaim_slab() maybe better. Link: https://lkml.kernel.org/r/20201030182704.GA53949@rlk Signed-off-by: Hui Su <[email protected]> Acked-by: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-12-15mm/compaction: make defer_compaction and compaction_deferred staticHui Su2-16/+4
defer_compaction() and compaction_deferred() and compaction_restarting() in mm/compaction.c won't be used in other files, so make them static, and remove the declaration in the header file. Take the chance to fix a typo. Link: https://lkml.kernel.org/r/20201123170801.GA9625@rlk Signed-off-by: Hui Su <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Nitin Gupta <[email protected]> Cc: Baoquan He <[email protected]> Cc: Mateusz Nosek <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-12-15mm/compaction: move compaction_suitable's comment to right placeHui Su1-7/+7
Since commit 837d026d560c ("mm/compaction: more trace to understand when/why compaction start/finish"), the comment place is not suitable. So move compaction_suitable's comment to right place. Link: https://lkml.kernel.org/r/20201116144121.GA385717@rlk Signed-off-by: Hui Su <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-12-15mm/compaction: rename 'start_pfn' to 'iteration_start_pfn' in compact_zone()Yanfei Xu1-4/+3
There are two 'start_pfn' declared in compact_zone() which have different meanings. Rename the second one to 'iteration_start_pfn' to prevent confusion. Also, remove an useless semicolon. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yanfei Xu <[email protected]> Acked-by: David Hildenbrand <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Acked-by: Pankaj Gupta <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-12-15z3fold: remove preempt disabled sections for RTVitaly Wool1-7/+10
Replace get_cpu_ptr() with migrate_disable()+this_cpu_ptr() so RT can take spinlocks that become sleeping locks. Signed-off-by Mike Galbraith <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vitaly Wool <[email protected]> Cc: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-12-15z3fold: stricter locking and more careful reclaimVitaly Wool1-58/+85
Use temporary slots in reclaim function to avoid possible race when freeing those. While at it, make sure we check CLAIMED flag under page lock in the reclaim function to make sure we are not racing with z3fold_alloc(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vitaly Wool <[email protected]> Cc: <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-12-15z3fold: simplify freeing slotsVitaly Wool1-42/+13
Patch series "z3fold: stability / rt fixes". Address z3fold stability issues under stress load, primarily in the reclaim and free aspects. Besides, it fixes the locking problems that were only seen in real-time kernel configuration. This patch (of 3): There used to be two places in the code where slots could be freed, namely when freeing the last allocated handle from the slots and when releasing the z3fold header these slots aree linked to. The logic to decide on whether to free certain slots was complicated and error prone in both functions and it led to failures in RT case. To fix that, make free_handle() the single point of freeing slots. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vitaly Wool <[email protected]> Tested-by: Mike Galbraith <[email protected]> Cc: Sebastian Andrzej Siewior <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-12-15mm/page_isolation: do not isolate the max order pageMuchun Song1-1/+1
A max order page has no buddy page and never merges to another order. So isolating and then freeing it is pointless. Link: https://lkml.kernel.org/r/[email protected] Fixes: 3c605096d315 ("mm/page_alloc: restrict max order of merging on isolated pageblock") Signed-off-by: Muchun Song <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-12-15mm/vmscan.c: remove the filename in the top of file commentlogic.yu1-2/+0
No point in having the filename inside the file. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: logic.yu <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-12-15mm/vmscan: drop unneeded assignment in kswapd()Lukas Bulwahn1-1/+1
The refactoring to kswapd() in commit e716f2eb24de ("mm, vmscan: prevent kswapd sleeping prematurely due to mismatched classzone_idx") turned an assignment to reclaim_order into a dead store, as in all further paths, reclaim_order will be assigned again before it is used. make clang-analyzer on x86_64 tinyconfig caught my attention with: mm/vmscan.c: warning: Although the value stored to 'reclaim_order' is used in the enclosing expression, the value is never actually read from 'reclaim_order' [clang-analyzer-deadcode.DeadStores] Compilers will detect this unneeded assignment and optimize this anyway. So, the resulting binary is identical before and after this change. Simplify the code and remove unneeded assignment to make clang-analyzer happy. No functional change. No change in binary code. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Lukas Bulwahn <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Nathan Chancellor <[email protected]> Cc: Nick Desaulniers <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-12-15mm: don't wake kswapd prematurely when watermark boosting is disabledJohannes Weiner1-6/+7
On 2-node NUMA hosts we see bursts of kswapd reclaim and subsequent pressure spikes and stalls from cache refaults while there is plenty of free memory in the system. Usually, kswapd is woken up when all eligible nodes in an allocation are full. But the code related to watermark boosting can wake kswapd on one full node while the other one is mostly empty. This may be justified to fight fragmentation, but is currently unconditionally done whether watermark boosting is occurring or not. In our case, many of our workloads' throughput scales with available memory, and pure utilization is a more tangible concern than trends around longer-term fragmentation. As a result we generally disable watermark boosting. Wake kswapd only woken when watermark boosting is requested. Link: https://lkml.kernel.org/r/[email protected] Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs") Signed-off-by: Johannes Weiner <[email protected]> Acked-by: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-12-15hugetlb: fix an error code in hugetlb_reserve_pages()Dan Carpenter1-0/+1
Preserve the error code from region_add() instead of returning success. Link: https://lkml.kernel.org/r/X9NGZWnZl5/Mt99R@mwanda Fixes: 0db9d74ed884 ("hugetlb: disable region_add file_region coalescing") Signed-off-by: Dan Carpenter <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Mina Almasry <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-12-15mm,hugetlb: remove unneeded initializationOscar Salvador1-2/+0
hugetlb_add_hstate initializes nr_huge_pages and free_huge_pages to 0, but since hstates[] is a global variable, all its fields are defined to 0 already. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Oscar Salvador <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-12-15mm: hugetlb: fix type of delta parameter and related local variables in ↵Liu Xiang1-3/+4
gather_surplus_pages() On 64-bit machine, delta variable in hugetlb_acct_memory() may be larger than 0xffffffff, but gather_surplus_pages() can only use the low 32-bit value now. So we need to fix type of delta parameter and related local variables in gather_surplus_pages(). Link: https://lkml.kernel.org/r/[email protected] Reported-by: Ma Chenggong <[email protected]> Signed-off-by: Liu Xiang <[email protected]> Signed-off-by: Pan Jiagen <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Cc: Liu Xiang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-12-15khugepaged: add parameter explanations for kernel-doc markupAlex Shi1-1/+13
Add missed parameter explanation for some kernel-doc warnings: mm/khugepaged.c:102: warning: Function parameter or member 'nr_pte_mapped_thp' not described in 'mm_slot' mm/khugepaged.c:102: warning: Function parameter or member 'pte_mapped_thp' not described in 'mm_slot' mm/khugepaged.c:1424: warning: Function parameter or member 'mm' not described in 'collapse_pte_mapped_thp' mm/khugepaged.c:1424: warning: Function parameter or member 'addr' not described in 'collapse_pte_mapped_thp' mm/khugepaged.c:1626: warning: Function parameter or member 'mm' not described in 'collapse_file' mm/khugepaged.c:1626: warning: Function parameter or member 'file' not described in 'collapse_file' mm/khugepaged.c:1626: warning: Function parameter or member 'start' not described in 'collapse_file' mm/khugepaged.c:1626: warning: Function parameter or member 'hpage' not described in 'collapse_file' mm/khugepaged.c:1626: warning: Function parameter or member 'node' not described in 'collapse_file' Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Alex Shi <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Jonathan Corbet <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-12-15include/linux/huge_mm.h: remove extern keywordRalph Campbell1-52/+41
The external function definitions don't need the "extern" keyword. Remove them so future changes don't copy the function definition style. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ralph Campbell <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-12-15mm/hugetlb.c: just use put_page_testzero() instead of page_count()Hui Su1-2/+1
We test the page reference count is zero or not here, it can be a bug here if page refercence count is not zero. So we can just use put_page_testzero() instead of page_count(). Link: https://lkml.kernel.org/r/20201007170949.GA6416@rlk Signed-off-by: Hui Su <[email protected]> Cc: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-12-15mm,hwpoison: return -EBUSY when migration failsOscar Salvador1-3/+3
Currently, we return -EIO when we fail to migrate the page. Migrations' failures are rather transient as they can happen due to several reasons, e.g: high page refcount bump, mapping->migrate_page failing etc. All meaning that at that time the page could not be migrated, but that has nothing to do with an EIO error. Let us return -EBUSY instead, as we do in case we failed to isolate the page. While are it, let us remove the "ret" print as its value does not change. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Oscar Salvador <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-12-15mm,memory_failure: always pin the page in madvise_inject_errorOscar Salvador2-8/+7
madvise_inject_error() uses get_user_pages_fast to translate the address we specified to a page. After [1], we drop the extra reference count for memory_failure() path. That commit says that memory_failure wanted to keep the pin in order to take the page out of circulation. The truth is that we need to keep the page pinned, otherwise the page might be re-used after the put_page() and we can end up messing with someone else's memory. E.g: CPU0 process X CPU1 madvise_inject_error get_user_pages put_page page gets reclaimed process Y allocates the page memory_failure // We mess with process Y memory madvise() is meant to operate on a self address space, so messing with pages that do not belong to us seems the wrong thing to do. To avoid that, let us keep the page pinned for memory_failure as well. Pages for DAX mappings will release this extra refcount in memory_failure_dev_pagemap. [1] ("23e7b5c2e271: mm, madvise_inject_error: Let memory_failure() optionally take a page reference") Link: https://lkml.kernel.org/r/[email protected] Fixes: 23e7b5c2e271 ("mm, madvise_inject_error: Let memory_failure() optionally take a page reference") Signed-off-by: Oscar Salvador <[email protected]> Suggested-by: Vlastimil Babka <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Dan Williams <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-12-15mm,hwpoison: remove drain_all_pages from shake_pageOscar Salvador1-5/+2
get_hwpoison_page already drains pcplists, previously disabling them when trying to grab a refcount. We do not need shake_page to take care of it anymore. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Oscar Salvador <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Cc: Qian Cai <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-12-15mm,hwpoison: disable pcplists before grabbing a refcountOscar Salvador1-67/+65
Currently, we have a sort of retry mechanism to make sure pages in pcp-lists are spilled to the buddy system, so we can handle those. We can save us this extra checks with the new disable-pcplist mechanism that is available with [1]. zone_pcplist_disable makes sure to 1) disable pcplists, so any page that is freed up from that point onwards will end up in the buddy system and 2) drain pcplists, so those pages that already in pcplists are spilled to buddy. With that, we can make a common entry point for grabbing a refcount from both soft_offline and memory_failure paths that is guarded by zone_pcplist_disable/zone_pcplist_enable. [1] https://patchwork.kernel.org/project/linux-mm/cover/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Oscar Salvador <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Qian Cai <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-12-15mm,hwpoison: refactor get_any_pageOscar Salvador1-57/+42
Patch series "HWPoison: Refactor get page interface", v2. This patch (of 3): When we want to grab a refcount via get_any_page, we call __get_any_page that calls get_hwpoison_page to get the actual refcount. get_any_page() is only there because we have a sort of retry mechanism in case the page we met is unknown to us or if we raced with an allocation. Also __get_any_page() prints some messages about the page type in case the page was a free page or the page type was unknown, but if anything, we only need to print a message in case the pagetype was unknown, as that is reporting an error down the chain. Let us merge get_any_page() and __get_any_page(), and let the message be printed in soft_offline_page. While we are it, we can also remove the 'pfn' parameter as it is no longer used. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Oscar Salvador <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Qian Cai <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-12-15mm,hwpoison: drop unneeded pcplist drainingOscar Salvador1-5/+0
memory_failure and soft_offline_path paths now drain pcplists by calling get_hwpoison_page. memory_failure flags the page as HWPoison before, so that page cannot longer go into a pcplist, and soft_offline_page only flags a page as HWPoison if 1) we took the page off a buddy freelist 2) the page was in-use and we migrated it 3) was a clean pagecache. Because of that, a page cannot longer be poisoned and be in a pcplist. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Oscar Salvador <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-12-15mm,hwpoison: take free pages off the buddy freelistsOscar Salvador1-16/+30
The crux of the matter is that historically we left poisoned pages in the buddy system because we have some checks in place when allocating a page that are gatekeeper for poisoned pages. Unfortunately, we do have other users (e.g: compaction [1]) that scan buddy freelists and try to get a page from there without checking whether the page is HWPoison. As I stated already, I think it is fundamentally wrong to keep HWPoison pages within the buddy systems, checks in place or not. Let us fix this the same way we did for soft_offline [2], taking the page off the buddy freelist so it is completely unreachable. Note that this is fairly simple to trigger, as we only need to poison free buddy pages (madvise MADV_HWPOISON) and then run some sort of memory stress system. Just for a matter of reference, I put a dump_page() in compaction_alloc() to trigger for HWPoison patches: page:0000000012b2982b refcount:1 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x1d5db flags: 0xfffffc0800000(hwpoison) raw: 000fffffc0800000 ffffea00007573c8 ffffc90000857de0 0000000000000000 raw: 0000000000000001 0000000000000000 00000001ffffffff 0000000000000000 page dumped because: compaction_alloc CPU: 4 PID: 123 Comm: kcompactd0 Tainted: G E 5.9.0-rc2-mm1-1-default+ #5 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014 Call Trace: dump_stack+0x6d/0x8b compaction_alloc+0xb2/0xc0 migrate_pages+0x2a6/0x12a0 compact_zone+0x5eb/0x11c0 proactive_compact_node+0x89/0xf0 kcompactd+0x2d0/0x3a0 kthread+0x118/0x130 ret_from_fork+0x22/0x30 After that, if e.g: a process faults in the page, it will get killed unexpectedly. Fix it by containing the page immediatelly. Besides that, two more changes can be noticed: * MF_DELAYED no longer suits as we are fixing the issue by containing the page immediately, so it does no longer rely on the allocation-time checks to stop HWPoison to be handed over. gain unless it is unpoisoned, so we fixed the situation. Because of that, let us use MF_RECOVERED from now on. * The second block that handles PageBuddy pages is no longer needed: We call shake_page and then check whether the page is Buddy because shake_page calls drain_all_pages, which sends pcp-pages back to the buddy freelists, so we could have a chance to handle free pages. Currently, get_hwpoison_page already calls drain_all_pages, and we call get_hwpoison_page right before coming here, so we should be on the safe side. [1] https://lore.kernel.org/linux-mm/20190826104144.GA7849@linux/T/#u [2] https://patchwork.kernel.org/cover/11792607/ [[email protected]: take the poisoned subpage off the buddy frelists] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Oscar Salvador <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-12-15mm,hwpoison: drain pcplists before bailing out for non-buddy zero-refcount pageOscar Salvador1-2/+22
Patch series "HWpoison: further fixes and cleanups", v5. This patchset includes some more fixes and a cleanup. Patch#2 and patch#3 are both fixes for taking a HWpoison page off a buddy freelist, since having them there has proved to be bad (see [1] and pathch#2's commit log). Patch#3 does the same for hugetlb pages. [1] https://lkml.org/lkml/2020/9/22/565 This patch (of 4): A page with 0-refcount and !PageBuddy could perfectly be a pcppage. Currently, we bail out with an error if we encounter such a page, meaning that we do not handle pcppages neither from hard-offline nor from soft-offline path. Fix this by draining pcplists whenever we find this kind of page and retry the check again. It might be that pcplists have been spilled into the buddy allocator and so we can handle it. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Oscar Salvador <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>