aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2017-09-06userfaultfd: provide pid in userfault msgAlexey Perevalov2-5/+13
It could be useful for calculating downtime during postcopy live migration per vCPU. Side observer or application itself will be informed about proper task's sleep during userfaultfd processing. Process's thread id is being provided when user requeste it by setting UFFD_FEATURE_THREAD_ID bit into uffdio_api.features. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Alexey Perevalov <[email protected]> Signed-off-by: Andrea Arcangeli <[email protected]> Cc: "Dr. David Alan Gilbert" <[email protected]> Cc: Maxime Coquelin <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mike Rapoport <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-06userfaultfd: call userfaultfd_unmap_prep only if __split_vma succeedsAndrea Arcangeli1-7/+15
A __split_vma is not a worthy event to report, and it's definitely not a unmap so it would be incorrect to report unmap for the whole region to the userfaultfd manager if a __split_vma fails. So only call userfaultfd_unmap_prep after the __vma_splitting is over and do_munmap cannot fail anymore. Also add unlikely because it's better to optimize for the vast majority of apps that aren't using userfaultfd in a non cooperative way. Ideally we should also find a way to eliminate the branch entirely if CONFIG_USERFAULTFD=n, but it would complicate things so stick to unlikely for now. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Andrea Arcangeli <[email protected]> Cc: "Dr. David Alan Gilbert" <[email protected]> Cc: Alexey Perevalov <[email protected]> Cc: Maxime Coquelin <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mike Rapoport <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-06userfaultfd: selftest: explicit failure if the SIGBUS test failedAndrea Arcangeli1-1/+3
Showing zero in the output isn't very self explanatory as a successful result. Show a more explicit error output if the test fails. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Andrea Arcangeli <[email protected]> Cc: "Dr. David Alan Gilbert" <[email protected]> Cc: Alexey Perevalov <[email protected]> Cc: Maxime Coquelin <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mike Rapoport <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-06userfaultfd: selftest: exercise UFFDIO_COPY/ZEROPAGE -EEXISTAndrea Arcangeli1-8/+140
This will retry the UFFDIO_COPY/ZEROPAGE to verify it returns -EEXIST at the first invocation and then later every 10 seconds. In the filebacked MAP_SHARED case this also verifies the -EEXIST triggered in the filesystem pagecache insertion, if the offset in the file was not a hole. shmem MAP_SHARED tries to index the newly allocated pagecache in the radix tree before checking the pagetable so it doesn't need any assistance to exercise that case. hugetlbfs checks the pmd to be not none before trying to index the hugetlbfs page in the radix tree, so it requires to run UFFDIO_COPY into an alias mapping (the alternative would be to use MADV_DONTNEED to only zap the pagetables, but that doesn't work on hugetlbfs). [[email protected]: fix uffdio_zeropage(), per Mike Kravetz] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Andrea Arcangeli <[email protected]> Cc: "Dr. David Alan Gilbert" <[email protected]> Cc: Alexey Perevalov <[email protected]> Cc: Maxime Coquelin <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mike Rapoport <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-06userfaultfd: selftest: add tests for UFFD_FEATURE_SIGBUS featurePrakash Sangappa1-3/+124
Add tests for UFFD_FEATURE_SIGBUS feature. The tests will verify signal delivery instead of userfault events. Also, test use of UFFDIO_COPY to allocate memory and retry accessing monitored area after signal delivery. Also fix a bug in uffd_poll_thread() where 'uffd' is leaked. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Prakash Sangappa <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mike Rapoport <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-06mm: userfaultfd: add feature to request for a signal deliveryPrakash Sangappa2-1/+12
In some cases, userfaultfd mechanism should just deliver a SIGBUS signal to the faulting process, instead of the page-fault event. Dealing with page-fault event using a monitor thread can be an overhead in these cases. For example applications like the database could use the signaling mechanism for robustness purpose. Database uses hugetlbfs for performance reason. Files on hugetlbfs filesystem are created and huge pages allocated using fallocate() API. Pages are deallocated/freed using fallocate() hole punching support. These files are mmapped and accessed by many processes as shared memory. The database keeps track of which offsets in the hugetlbfs file have pages allocated. Any access to mapped address over holes in the file, which can occur due to bugs in the application, is considered invalid and expect the process to simply receive a SIGBUS. However, currently when a hole in the file is accessed via the mapped address, kernel/mm attempts to automatically allocate a page at page fault time, resulting in implicitly filling the hole in the file. This may not be the desired behavior for applications like the database that want to explicitly manage page allocations of hugetlbfs files. Using userfaultfd mechanism with this support to get a signal, database application can prevent pages from being allocated implicitly when processes access mapped address over holes in the file. This patch adds UFFD_FEATURE_SIGBUS feature to userfaultfd mechnism to request for a SIGBUS signal. See following for previous discussion about the database requirement leading to this proposal as suggested by Andrea. http://www.spinics.net/lists/linux-mm/msg129224.html Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Prakash Sangappa <[email protected]> Reviewed-by: Mike Rapoport <[email protected]> Reviewed-by: Andrea Arcangeli <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-06mm: rename global_page_state to global_zone_page_stateMichal Hocko9-25/+25
global_page_state is error prone as a recent bug report pointed out [1]. It only returns proper values for zone based counters as the enum it gets suggests. We already have global_node_page_state so let's rename global_page_state to global_zone_page_state to be more explicit here. All existing users seems to be correct: $ git grep "global_page_state(NR_" | sed 's@.*(\(NR_[A-Z_]*\)).*@\1@' | sort | uniq -c 2 NR_BOUNCE 2 NR_FREE_CMA_PAGES 11 NR_FREE_PAGES 1 NR_KERNEL_STACK_KB 1 NR_MLOCK 2 NR_PAGETABLE This patch shouldn't introduce any functional change. [1] http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Signed-off-by: Johannes Weiner <[email protected]> Cc: Tetsuo Handa <[email protected]> Cc: Josef Bacik <[email protected]> Cc: Vladimir Davydov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-06mm: shm: use new hugetlb size encoding definitionsMike Kravetz2-19/+29
Use the common definitions from hugetlb_encode.h header file for encoding hugetlb size definitions in shmget system call flags. In addition, move these definitions from the internal (kernel) to user (uapi) header file. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Kravetz <[email protected]> Suggested-by: Matthew Wilcox <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Michael Kerrisk <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-06mm: arch: consolidate mmap hugetlb size encodingsMike Kravetz8-74/+22
A non-default huge page size can be encoded in the flags argument of the mmap system call. The definitions for these encodings are in arch specific header files. However, all architectures use the same values. Consolidate all the definitions in the primary user header file (uapi/linux/mman.h). Include definitions for all known huge page sizes. Use the generic encoding definitions in hugetlb_encode.h as the basis for these definitions. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Kravetz <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michael Kerrisk <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-06mm: hugetlb: define system call hugetlb size encodings in single fileMike Kravetz1-0/+34
Patch series "Consolidate system call hugetlb page size encodings". These patches are the result of discussions in https://lkml.org/lkml/2017/3/8/548. The following changes are made in the patch set: 1) Put all the log2 encoded huge page size definitions in a common header file. The idea is have a set of definitions that can be use as the basis for system call specific definitions such as MAP_HUGE_* and SHM_HUGE_*. 2) Remove MAP_HUGE_* definitions in arch specific files. All these definitions are the same. Consolidate all definitions in the primary user header file (uapi/linux/mman.h). 3) Remove SHM_HUGE_* definitions intended for user space from kernel header file, and add to user (uapi/linux/shm.h) header file. Add definitions for all known huge page size encodings as in mmap. This patch (of 3): If hugetlb pages are requested in mmap or shmget system calls, a huge page size other than default can be requested. This is accomplished by encoding the log2 of the huge page size in the upper bits of the flag argument. asm-generic and arch specific headers all define the same values for these encodings. Put common definitions in a single header file. The primary uapi header files for mmap and shm will use these definitions as a basis for definitions specific to those system calls. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Kravetz <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Michael Kerrisk <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Arnd Bergmann <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-06include/linux/fs.h: remove unneeded forward definition of mm_structJeff Layton1-2/+0
Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jeff Layton <[email protected]> Reviewed-by: Jan Kara <[email protected]> Cc: Alexander Viro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-06fs/sync.c: remove unnecessary NULL f_mapping check in sync_file_rangeJeff Layton1-5/+0
fsync codepath assumes that f_mapping can never be NULL, but sync_file_range has a check for that. Remove the one from sync_file_range as I don't see how you'd ever get a NULL pointer in here. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jeff Layton <[email protected]> Cc: Alexander Viro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-06userfaultfd: selftest: enable testing of UFFDIO_ZEROPAGE for shmemMike Rapoport1-1/+1
Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Pavel Emelyanov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-06userfaultfd: report UFFDIO_ZEROPAGE as available for shmem VMAsMike Rapoport1-5/+5
Now when shmem VMAs can be filled with zero page via userfaultfd we can report that UFFDIO_ZEROPAGE is available for those VMAs Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Pavel Emelyanov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-06userfaultfd: shmem: wire up shmem_mfill_zeropage_pteMike Rapoport1-2/+4
For shmem VMAs we can use shmem_mfill_zeropage_pte for UFFDIO_ZEROPAGE Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Pavel Emelyanov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-06userfaultfd: mcopy_atomic: introduce mfill_atomic_pte helperMike Rapoport1-16/+30
Shuffle the code a bit to improve readability. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Pavel Emelyanov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-06userfaultfd: shmem: add shmem_mfill_zeropage_pte for userfaultfd supportMike Rapoport2-17/+51
shmem_mfill_zeropage_pte is the low level routine that implements the userfaultfd UFFDIO_ZEROPAGE command. Since for shmem mappings zero pages are always allocated and accounted, the new method is a slight extension of the existing shmem_mcopy_atomic_pte. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Pavel Emelyanov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-06shmem: introduce shmem_inode_acct_blockMike Rapoport1-56/+46
The shmem_acct_block and the update of used_blocks are following one another in all the places they are used. Combine these two into a helper function. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Pavel Emelyanov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-06shmem: shmem_charge: verify max_block is not exceeded before inode updateMike Rapoport1-13/+12
Patch series "userfaultfd: enable zeropage support for shmem". These patches enable support for UFFDIO_ZEROPAGE for shared memory. The first two patches are not strictly related to userfaultfd, they are just minor refactoring to reduce amount of code duplication. This patch (of 7): Currently we update inode and shmem_inode_info before verifying that used_blocks will not exceed max_blocks. In case it will, we undo the update. Let's switch the order and move the verification of the blocks count before the inode and shmem_inode_info update. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Pavel Emelyanov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-06mm, THP, swap: add THP swapping out fallback countingHuang Ying3-0/+5
When swapping out THP (Transparent Huge Page), instead of swapping out the THP as a whole, sometimes we have to fallback to split the THP into normal pages before swapping, because no free swap clusters are available, or cgroup limit is exceeded, etc. To count the number of the fallback, a new VM event THP_SWPOUT_FALLBACK is added, and counted when we fallback to split the THP. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: "Huang, Ying" <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Dan Williams <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Ross Zwisler <[email protected]> [for brd.c, zram_drv.c, pmem.c] Cc: Vishal L Verma <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-06mm, THP, swap: delay splitting THP after swapped outHuang Ying1-43/+52
In this patch, splitting transparent huge page (THP) during swapping out is delayed from after adding the THP into the swap cache to after swapping out finishes. After the patch, more operations for the anonymous THP reclaiming, such as writing the THP to the swap device, removing the THP from the swap cache could be batched. So that the performance of anonymous THP swapping out could be improved. This is the second step for the THP swap support. The plan is to delay splitting the THP step by step and avoid splitting the THP finally. With the patchset, the swap out throughput improves 42% (from about 5.81GB/s to about 8.25GB/s) in the vm-scalability swap-w-seq test case with 16 processes. At the same time, the IPI (reflect TLB flushing) reduced about 78.9%. The test is done on a Xeon E5 v3 system. The swap device used is a RAM simulated PMEM (persistent memory) device. To test the sequential swapping out, the test case creates 8 processes, which sequentially allocate and write to the anonymous pages until the RAM and part of the swap device is used up. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: "Huang, Ying" <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Dan Williams <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Ross Zwisler <[email protected]> [for brd.c, zram_drv.c, pmem.c] Cc: Vishal L Verma <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-06memcg, THP, swap: make mem_cgroup_swapout() support THPHuang Ying1-8/+15
This patch makes mem_cgroup_swapout() works for the transparent huge page (THP). Which will move the memory cgroup charge from memory to swap for a THP. This will be used for the THP swap support. Where a THP may be swapped out as a whole to a set of (HPAGE_PMD_NR) continuous swap slots on the swap device. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: "Huang, Ying" <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Dan Williams <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Ross Zwisler <[email protected]> [for brd.c, zram_drv.c, pmem.c] Cc: Shaohua Li <[email protected]> Cc: Vishal L Verma <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-06memcg, THP, swap: avoid to duplicated charge THP in swap cacheHuang Ying1-1/+1
For a THP (Transparent Huge Page), tail_page->mem_cgroup is NULL. So to check whether the page is charged already, we need to check the head page. This is not an issue before because it is impossible for a THP to be in the swap cache before. But after we add delaying splitting THP after swapped out support, it is possible now. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: "Huang, Ying" <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Dan Williams <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Ross Zwisler <[email protected]> [for brd.c, zram_drv.c, pmem.c] Cc: Shaohua Li <[email protected]> Cc: Vishal L Verma <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-06memcg, THP, swap: support move mem cgroup charge for THP swapped outHuang Ying1-2/+5
PTE mapped THP (Transparent Huge Page) will be ignored when moving memory cgroup charge. But for THP which is in the swap cache, the memory cgroup charge for the swap of a tail-page may be moved in current implementation. That isn't correct, because the swap charge for all sub-pages of a THP should be moved together. Following the processing of the PTE mapped THP, the mem cgroup charge moving for the swap entry for a tail-page of a THP is ignored too. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: "Huang, Ying" <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Dan Williams <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Ross Zwisler <[email protected]> [for brd.c, zram_drv.c, pmem.c] Cc: Shaohua Li <[email protected]> Cc: Vishal L Verma <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-06mm, THP, swap: support splitting THP for THP swap outHuang Ying3-1/+33
After adding swapping out support for THP (Transparent Huge Page), it is possible that a THP in swap cache (partly swapped out) need to be split. To split such a THP, the swap cluster backing the THP need to be split too, that is, the CLUSTER_FLAG_HUGE flag need to be cleared for the swap cluster. The patch implemented this. And because the THP swap writing needs the THP keeps as huge page during writing. The PageWriteback flag is checked before splitting. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: "Huang, Ying" <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Dan Williams <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Ross Zwisler <[email protected]> [for brd.c, zram_drv.c, pmem.c] Cc: Vishal L Verma <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-06mm: test code to write THP to swap device as a wholeHuang Ying5-7/+28
To support delay splitting THP (Transparent Huge Page) after swapped out, we need to enhance swap writing code to support to write a THP as a whole. This will improve swap write IO performance. As Ming Lei <[email protected]> pointed out, this should be based on multipage bvec support, which hasn't been merged yet. So this patch is only for testing the functionality of the other patches in the series. And will be reimplemented after multipage bvec support is merged. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: "Huang, Ying" <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Dan Williams <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Ross Zwisler <[email protected]> [for brd.c, zram_drv.c, pmem.c] Cc: Shaohua Li <[email protected]> Cc: Vishal L Verma <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-06block, THP: make block_device_operations.rw_page support THPHuang Ying4-13/+40
The .rw_page in struct block_device_operations is used by the swap subsystem to read/write the page contents from/into the corresponding swap slot in the swap device. To support the THP (Transparent Huge Page) swap optimization, the .rw_page is enhanced to support to read/write THP if possible. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: "Huang, Ying" <[email protected]> Reviewed-by: Ross Zwisler <[email protected]> [for brd.c, zram_drv.c, pmem.c] Cc: Johannes Weiner <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Dan Williams <[email protected]> Cc: Vishal L Verma <[email protected]> Cc: Jens Axboe <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Shaohua Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-06mm, THP, swap: don't allocate huge cluster for file backed swap deviceHuang Ying1-3/+4
It's hard to write a whole transparent huge page (THP) to a file backed swap device during swapping out and the file backed swap device isn't very popular. So the huge cluster allocation for the file backed swap device is disabled. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: "Huang, Ying" <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Rik van Riel <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Dan Williams <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Ross Zwisler <[email protected]> [for brd.c, zram_drv.c, pmem.c] Cc: Vishal L Verma <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-06mm, THP, swap: make reuse_swap_page() works for THP swapped outHuang Ying4-15/+113
After supporting to delay THP (Transparent Huge Page) splitting after swapped out, it is possible that some page table mappings of the THP are turned into swap entries. So reuse_swap_page() need to check the swap count in addition to the map count as before. This patch done that. In the huge PMD write protect fault handler, in addition to the page map count, the swap count need to be checked too, so the page lock need to be acquired too when calling reuse_swap_page() in addition to the page table lock. [[email protected]: silence a compiler warning] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: "Huang, Ying" <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Dan Williams <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Ross Zwisler <[email protected]> [for brd.c, zram_drv.c, pmem.c] Cc: Vishal L Verma <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-06mm, THP, swap: support to reclaim swap space for THP swapped outHuang Ying2-7/+72
The normal swap slot reclaiming can be done when the swap count reaches SWAP_HAS_CACHE. But for the swap slot which is backing a THP, all swap slots backing one THP must be reclaimed together, because the swap slot may be used again when the THP is swapped out again later. So the swap slots backing one THP can be reclaimed together when the swap count for all swap slots for the THP reached SWAP_HAS_CACHE. In the patch, the functions to check whether the swap count for all swap slots backing one THP reached SWAP_HAS_CACHE are implemented and used when checking whether a swap slot can be reclaimed. To make it easier to determine whether a swap slot is backing a THP, a new swap cluster flag named CLUSTER_FLAG_HUGE is added to mark a swap cluster which is backing a THP (Transparent Huge Page). Because THP swap in as a whole isn't supported now. After deleting the THP from the swap cache (for example, swapping out finished), the CLUSTER_FLAG_HUGE flag will be cleared. So that, the normal pages inside THP can be swapped in individually. [[email protected]: fix swap_page_trans_huge_swapped on HDD] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: "Huang, Ying" <[email protected]> Acked-by: Rik van Riel <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Shaohua Li <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Dan Williams <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Ross Zwisler <[email protected]> [for brd.c, zram_drv.c, pmem.c] Cc: Vishal L Verma <[email protected]> Cc: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-06mm, THP, swap: support to clear swap cache flag for THP swapped outHuang Ying1-7/+25
Patch series "mm, THP, swap: Delay splitting THP after swapped out", v3. This is the second step of THP (Transparent Huge Page) swap optimization. In the first step, the splitting huge page is delayed from almost the first step of swapping out to after allocating the swap space for the THP and adding the THP into the swap cache. In the second step, the splitting is delayed further to after the swapping out finished. The plan is to delay splitting THP step by step, finally avoid splitting THP for the THP swapping out and swap out/in the THP as a whole. In the patchset, more operations for the anonymous THP reclaiming, such as TLB flushing, writing the THP to the swap device, removing the THP from the swap cache are batched. So that the performance of anonymous THP swapping out are improved. During the development, the following scenarios/code paths have been checked, - swap out/in - swap off - write protect page fault - madvise_free - process exit - split huge page With the patchset, the swap out throughput improves 42% (from about 5.81GB/s to about 8.25GB/s) in the vm-scalability swap-w-seq test case with 16 processes. At the same time, the IPI (reflect TLB flushing) reduced about 78.9%. The test is done on a Xeon E5 v3 system. The swap device used is a RAM simulated PMEM (persistent memory) device. To test the sequential swapping out, the test case creates 8 processes, which sequentially allocate and write to the anonymous pages until the RAM and part of the swap device is used up. Below is the part of the cover letter for the first step patchset of THP swap optimization which applies to all steps. ========================= Recently, the performance of the storage devices improved so fast that we cannot saturate the disk bandwidth with single logical CPU when do page swap out even on a high-end server machine. Because the performance of the storage device improved faster than that of single logical CPU. And it seems that the trend will not change in the near future. On the other hand, the THP becomes more and more popular because of increased memory size. So it becomes necessary to optimize THP swap performance. The advantages of the THP swap support include: - Batch the swap operations for the THP to reduce TLB flushing and lock acquiring/releasing, including allocating/freeing the swap space, adding/deleting to/from the swap cache, and writing/reading the swap space, etc. This will help improve the performance of the THP swap. - The THP swap space read/write will be 2M sequential IO. It is particularly helpful for the swap read, which are usually 4k random IO. This will improve the performance of the THP swap too. - It will help the memory fragmentation, especially when the THP is heavily used by the applications. The 2M continuous pages will be free up after THP swapping out. - It will improve the THP utilization on the system with the swap turned on. Because the speed for khugepaged to collapse the normal pages into the THP is quite slow. After the THP is split during the swapping out, it will take quite long time for the normal pages to collapse back into the THP after being swapped in. The high THP utilization helps the efficiency of the page based memory management too. There are some concerns regarding THP swap in, mainly because possible enlarged read/write IO size (for swap in/out) may put more overhead on the storage device. To deal with that, the THP swap in should be turned on only when necessary. For example, it can be selected via "always/never/madvise" logic, to be turned on globally, turned off globally, or turned on only for VMA with MADV_HUGEPAGE, etc. This patch (of 12): Previously, swapcache_free_cluster() is used only in the error path of shrink_page_list() to free the swap cluster just allocated if the THP (Transparent Huge Page) is failed to be split. In this patch, it is enhanced to clear the swap cache flag (SWAP_HAS_CACHE) for the swap cluster that holds the contents of THP swapped out. This will be used in delaying splitting THP after swapping out support. Because there is no THP swapping in as a whole support yet, after clearing the swap cache flag, the swap cluster backing the THP swapped out will be split. So that the swap slots in the swap cluster can be swapped in as normal pages later. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: "Huang, Ying" <[email protected]> Acked-by: Rik van Riel <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Shaohua Li <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Dan Williams <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Ross Zwisler <[email protected]> [for brd.c, zram_drv.c, pmem.c] Cc: Vishal L Verma <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-06mm: memcontrol: use int for event/state parameter in several functionsMatthias Kaehlcke2-21/+35
Several functions use an enum type as parameter for an event/state, but are called in some locations with an argument of a different enum type. Adjust the interface of these functions to reality by changing the parameter to int. This fixes a ton of enum-conversion warnings that are generated when building the kernel with clang. [[email protected]: also change parameter type of inc/dec/mod_memcg_page_state()] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Matthias Kaehlcke <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Doug Anderson <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-06mm/hugetlb.c: constify attribute_group structuresArvind Yadav1-3/+3
attribute_group are not supposed to change at runtime. All functions working with attribute_group provided by <linux/sysfs.h> work with const attribute_group. So mark the non-const structs as const. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Arvind Yadav <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-06mm/huge_memory.c: constify attribute_group structuresArvind Yadav1-1/+1
attribute_group are not supposed to change at runtime. All functions working with attribute_group provided by <linux/sysfs.h> work with const attribute_group. So mark the non-const structs as const. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Arvind Yadav <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-06mm/page_idle.c: constify attribute_group structuresArvind Yadav1-1/+1
attribute_group are not supposed to change at runtime. All functions working with attribute_group provided by <linux/sysfs.h> work with const attribute_group. So mark the non-const structs as const. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Arvind Yadav <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-06mm/slub.c: constify attribute_group structuresArvind Yadav1-1/+1
attribute_group are not supposed to change at runtime. All functions working with attribute_group provided by <linux/sysfs.h> work with const attribute_group. So mark the non-const structs as const. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Arvind Yadav <[email protected]> Acked-by: Christoph Lameter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-06mm/ksm.c: constify attribute_group structuresArvind Yadav1-1/+1
attribute_group are not supposed to change at runtime. All functions working with attribute_group provided by <linux/sysfs.h> work with const attribute_group. So mark the non-const structs as const. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Arvind Yadav <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-06cgroup: revert fa06235b8eb0 ("cgroup: reset css on destruction")Roman Gushchin1-3/+0
Commit fa06235b8eb0 ("cgroup: reset css on destruction") caused css_reset callback to be called from the offlining path. Although it solves the problem mentioned in the commit description ("For instance, memory cgroup needs to reset memory.low, otherwise pages charged to a dead cgroup might never get reclaimed."), generally speaking, it's not correct. An offline cgroup can still be a resource domain, and we shouldn't grant it more resources than it had before deletion. For instance, if an offline memory cgroup has dirty pages, we should still imply i/o limits during writeback. The css_reset callback is designed to return the cgroup state into the original state, that means reset all limits and counters. It's spomething different from the offlining, and we shouldn't use it from the offlining path. Instead, we should adjust necessary settings from the per-controller css_offline callbacks (e.g. reset memory.low). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Roman Gushchin <[email protected]> Acked-by: Tejun Heo <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-06mm, memcg: reset memory.low during memcg offliningRoman Gushchin1-0/+2
A removed memory cgroup with a defined memory.low and some belonging pagecache has very low chances to be freed. If a cgroup has been removed, there is likely no memory pressure inside the cgroup, and the pagecache is protected from the external pressure by the defined low limit. The cgroup will be freed only after the reclaim of all belonging pages. And it will not happen until there are any reclaimable memory in the system. That means, there is a good chance, that a cold pagecache will reside in the memory for an undefined amount of time, wasting system resources. This problem was fixed earlier by fa06235b8eb0 ("cgroup: reset css on destruction"), but it's not a best way to do it, as we can't really reset all limits/counters during cgroup offlining. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Roman Gushchin <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-06mm: remove nr_pages argument from pagevec_lookup{,_range}()Jan Kara8-18/+13
All users of pagevec_lookup() and pagevec_lookup_range() now pass PAGEVEC_SIZE as a desired number of pages. Just drop the argument. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jan Kara <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-06mm: use find_get_pages_range() in filemap_range_has_page()Jan Kara1-7/+4
We want only pages from given range in filemap_range_has_page(), furthermore we want at most a single page. So use find_get_pages_range() instead of pagevec_lookup() and remove unnecessary code. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jan Kara <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-06fs: use pagevec_lookup_range() in page_cache_seek_hole_data()Jan Kara1-13/+3
We want only pages from given range in page_cache_seek_hole_data(). Use pagevec_lookup_range() instead of pagevec_lookup() and remove unnecessary code. Note that the check for getting less pages than desired can be removed because index gets updated by pagevec_lookup_range(). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jan Kara <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-06hugetlbfs: use pagevec_lookup_range() in remove_inode_hugepages()Jan Kara1-16/+2
We want only pages from given range in remove_inode_hugepages(). Use pagevec_lookup_range() instead of pagevec_lookup(). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jan Kara <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Cc: Nadia Yvette Chambers <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-06ext4: use pagevec_lookup_range() in writeback codeJan Kara1-7/+5
Both occurences of pagevec_lookup() actually want only pages from a given range. Use pagevec_lookup_range() for the lookup. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jan Kara <[email protected]> Cc: "Theodore Ts'o" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-06ext4: use pagevec_lookup_range() in ext4_find_unwritten_pgoff()Jan Kara1-10/+4
Use pagevec_lookup_range() in ext4_find_unwritten_pgoff() since we are interested only in pages in the given range. Simplify the logic as a result of not getting pages out of range and index getting automatically advanced. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jan Kara <[email protected]> Cc: "Theodore Ts'o" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-06fs: fix performance regression in clean_bdev_aliases()Jan Kara1-6/+8
Commit e64855c6cfaa ("fs: Add helper to clean bdev aliases under a bh and use it") added a wrapper for clean_bdev_aliases() that invalidates bdev aliases underlying a single buffer head. However this has caused a performance regression for bonnie++ benchmark on ext4 filesystem when delayed allocation is turned off (ext3 mode) - average of 3 runs: Hmean SeqOut Char 164787.55 ( 0.00%) 107189.06 (-34.95%) Hmean SeqOut Block 219883.89 ( 0.00%) 168870.32 (-23.20%) The reason for this regression is that clean_bdev_aliases() is slower when called for a single block because pagevec_lookup() it uses will end up iterating through the radix tree until it finds a page (which may take a while) but we are only interested whether there's a page at a particular index. Fix the problem by using pagevec_lookup_range() instead which avoids the needless iteration. Fixes: e64855c6cfaa ("fs: Add helper to clean bdev aliases under a bh and use it") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jan Kara <[email protected]> Cc: Jens Axboe <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-06mm: implement find_get_pages_range()Jan Kara4-24/+65
Implement a variant of find_get_pages() that stops iterating at given index. This may be substantial performance gain if the mapping is sparse. See following commit for details. Furthermore lots of users of this function (through pagevec_lookup()) actually want a range lookup and all of them are currently open-coding this. Also create corresponding pagevec_lookup_range() function. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jan Kara <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-06mm: make pagevec_lookup() update indexJan Kara11-37/+32
Make pagevec_lookup() (and underlying find_get_pages()) update index to the next page where iteration should continue. Most callers want this and also pagevec_lookup_tag() already does this. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jan Kara <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-06fscache: remove unused ->now_uncached callbackJan Kara7-185/+0
Patch series "Ranged pagevec lookup", v2. In this series I make pagevec_lookup() update the index (to be consistent with pagevec_lookup_tag() and also as a preparation for ranged lookups), provide ranged variant of pagevec_lookup() and use it in places where it makes sense. This not only removes some common code but is also a measurable performance win for some use cases (see patch 4/10) where radix tree is sparse and searching & grabing of a page after the end of the range has measurable overhead. This patch (of 10): The callback doesn't ever get called. Remove it. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jan Kara <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-06mm, vmscan: do not loop on too_many_isolated for everMichal Hocko1-1/+7
Tetsuo Handa has reported[1][2][3] that direct reclaimers might get stuck in too_many_isolated loop basically for ever because the last few pages on the LRU lists are isolated by the kswapd which is stuck on fs locks when doing the pageout or slab reclaim. This in turn means that there is nobody to actually trigger the oom killer and the system is basically unusable. too_many_isolated has been introduced by commit 35cd78156c49 ("vmscan: throttle direct reclaim when too many pages are isolated already") to prevent from pre-mature oom killer invocations because back then no reclaim progress could indeed trigger the OOM killer too early. But since the oom detection rework in commit 0a0337e0d1d1 ("mm, oom: rework oom detection") the allocation/reclaim retry loop considers all the reclaimable pages and throttles the allocation at that layer so we can loosen the direct reclaim throttling. Make shrink_inactive_list loop over too_many_isolated bounded and returns immediately when the situation hasn't resolved after the first sleep. Replace congestion_wait by a simple schedule_timeout_interruptible because we are not really waiting on the IO congestion in this path. Please note that this patch can theoretically cause the OOM killer to trigger earlier while there are many pages isolated for the reclaim which makes progress only very slowly. This would be obvious from the oom report as the number of isolated pages are printed there. If we ever hit this should_reclaim_retry should consider those numbers in the evaluation in one way or another. [1] http://lkml.kernel.org/r/[email protected] [2] http://lkml.kernel.org/r/[email protected] [3] http://lkml.kernel.org/r/[email protected] [[email protected]: switch to uninterruptible sleep] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Reported-by: Tetsuo Handa <[email protected]> Tested-by: Tetsuo Handa <[email protected]> Acked-by: Mel Gorman <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Acked-by: Rik van Riel <[email protected]> Acked-by: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>