aboutsummaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2017-07-08Merge branch 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-armLinus Torvalds1-0/+4
Pull ARM updates from Russell King: - add support for ftrace-with-registers, which is needed for kgraft and other ftrace tools - support for mremap() for the sigpage/vDSO so that checkpoint/restore can work - add timestamps to each line of the register dump output - remove the unused KTHREAD_SIZE from nommu - align the ARM bitops APIs with the generic API (using unsigned long pointers rather than void pointers) - make the configuration of userspace Thumb support an expert option so that we can default it on, and avoid some hard to debug userspace crashes * 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm: ARM: 8684/1: NOMMU: Remove unused KTHREAD_SIZE definition ARM: 8683/1: ARM32: Support mremap() for sigpage/vDSO ARM: 8679/1: bitops: Align prototypes to generic API ARM: 8678/1: ftrace: Adds support for CONFIG_DYNAMIC_FTRACE_WITH_REGS ARM: make configuration of userspace Thumb support an expert option ARM: 8673/1: Fix __show_regs output timestamps
2017-07-07Merge tag 'for-linus-v4.13-2' of ↵Linus Torvalds2-18/+110
git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux Pull Writeback error handling updates from Jeff Layton: "This pile represents the bulk of the writeback error handling fixes that I have for this cycle. Some of the earlier patches in this pile may look trivial but they are prerequisites for later patches in the series. The aim of this set is to improve how we track and report writeback errors to userland. Most applications that care about data integrity will periodically call fsync/fdatasync/msync to ensure that their writes have made it to the backing store. For a very long time, we have tracked writeback errors using two flags in the address_space: AS_EIO and AS_ENOSPC. Those flags are set when a writeback error occurs (via mapping_set_error) and are cleared as a side-effect of filemap_check_errors (as you noted yesterday). This model really sucks for userland. Only the first task to call fsync (or msync or fdatasync) will see the error. Any subsequent task calling fsync on a file will get back 0 (unless another writeback error occurs in the interim). If I have several tasks writing to a file and calling fsync to ensure that their writes got stored, then I need to have them coordinate with one another. That's difficult enough, but in a world of containerized setups that coordination may even not be possible. But wait...it gets worse! The calls to filemap_check_errors can be buried pretty far down in the call stack, and there are internal callers of filemap_write_and_wait and the like that also end up clearing those errors. Many of those callers ignore the error return from that function or return it to userland at nonsensical times (e.g. truncate() or stat()). If I get back -EIO on a truncate, there is no reason to think that it was because some previous writeback failed, and a subsequent fsync() will (incorrectly) return 0. This pile aims to do three things: 1) ensure that when a writeback error occurs that that error will be reported to userland on a subsequent fsync/fdatasync/msync call, regardless of what internal callers are doing 2) report writeback errors on all file descriptions that were open at the time that the error occurred. This is a user-visible change, but I think most applications are written to assume this behavior anyway. Those that aren't are unlikely to be hurt by it. 3) document what filesystems should do when there is a writeback error. Today, there is very little consistency between them, and a lot of cargo-cult copying. We need to make it very clear what filesystems should do in this situation. To achieve this, the set adds a new data type (errseq_t) and then builds new writeback error tracking infrastructure around that. Once all of that is in place, we change the filesystems to use the new infrastructure for reporting wb errors to userland. Note that this is just the initial foray into cleaning up this mess. There is a lot of work remaining here: 1) convert the rest of the filesystems in a similar fashion. Once the initial set is in, then I think most other fs' will be fairly simple to convert. Hopefully most of those can in via individual filesystem trees. 2) convert internal waiters on writeback to use errseq_t for detecting errors instead of relying on the AS_* flags. I have some draft patches for this for ext4, but they are not quite ready for prime time yet. This was a discussion topic this year at LSF/MM too. If you're interested in the gory details, LWN has some good articles about this: https://lwn.net/Articles/718734/ https://lwn.net/Articles/724307/" * tag 'for-linus-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux: btrfs: minimal conversion to errseq_t writeback error reporting on fsync xfs: minimal conversion to errseq_t writeback error reporting ext4: use errseq_t based error handling for reporting data writeback errors fs: convert __generic_file_fsync to use errseq_t based reporting block: convert to errseq_t based writeback error tracking dax: set errors in mapping when writeback fails Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error fs: new infrastructure for writeback error handling and reporting lib: add errseq_t type and infrastructure for handling it mm: don't TestClearPageError in __filemap_fdatawait_range mm: clear AS_EIO/AS_ENOSPC when writeback initiation fails jbd2: don't clear and reset errors after waiting on writeback buffer: set errors in mapping at the time that the error occurs fs: check for writeback errors after syncing out buffers in generic_file_fsync buffer: use mapping_set_error instead of setting the flag mm: fix mapping_set_error call in me_pagecache_dirty
2017-07-07Merge tag 'for-linus-v4.13-1' of ↵Linus Torvalds1-10/+9
git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux Pull Writeback error handling fixes from Jeff Layton: "The main rationale for all of these changes is to tighten up writeback error reporting to userland. There are many ways now that writeback errors can be lost, such that fsync/fdatasync/msync return 0 when writeback actually failed. This pile contains a small set of cleanups and writeback error handling fixes that I was able to break off from the main pile (#2). Two of the patches in this pile are trivial. The exceptions are the patch to fix up error handling in write_one_page, and the patch to make JFS pay attention to write_one_page errors" * tag 'for-linus-v4.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux: fs: remove call_fsync helper function mm: clean up error handling in write_one_page JFS: do not ignore return code from write_one_page() mm: drop "wait" parameter from write_one_page()
2017-07-07Merge tag 'powerpc-4.13-1' of ↵Linus Torvalds1-1/+5
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: "Highlights include: - Support for STRICT_KERNEL_RWX on 64-bit server CPUs. - Platform support for FSP2 (476fpe) board - Enable ZONE_DEVICE on 64-bit server CPUs. - Generic & powerpc spin loop primitives to optimise busy waiting - Convert VDSO update function to use new update_vsyscall() interface - Optimisations to hypercall/syscall/context-switch paths - Improvements to the CPU idle code on Power8 and Power9. As well as many other fixes and improvements. Thanks to: Akshay Adiga, Andrew Donnellan, Andrew Jeffery, Anshuman Khandual, Anton Blanchard, Balbir Singh, Benjamin Herrenschmidt, Christophe Leroy, Christophe Lombard, Colin Ian King, Dan Carpenter, Gautham R. Shenoy, Hari Bathini, Ian Munsie, Ivan Mikhaylov, Javier Martinez Canillas, Madhavan Srinivasan, Masahiro Yamada, Matt Brown, Michael Neuling, Michal Suchanek, Murilo Opsfelder Araujo, Naveen N. Rao, Nicholas Piggin, Oliver O'Halloran, Paul Mackerras, Pavel Machek, Russell Currey, Santosh Sivaraj, Stephen Rothwell, Thiago Jung Bauermann, Yang Li" * tag 'powerpc-4.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (158 commits) powerpc/Kconfig: Enable STRICT_KERNEL_RWX for some configs powerpc/mm/radix: Implement STRICT_RWX/mark_rodata_ro() for Radix powerpc/mm/hash: Implement mark_rodata_ro() for hash powerpc/vmlinux.lds: Align __init_begin to 16M powerpc/lib/code-patching: Use alternate map for patch_instruction() powerpc/xmon: Add patch_instruction() support for xmon powerpc/kprobes/optprobes: Use patch_instruction() powerpc/kprobes: Move kprobes over to patch_instruction() powerpc/mm/radix: Fix execute permissions for interrupt_vectors powerpc/pseries: Fix passing of pp0 in updatepp() and updateboltedpp() powerpc/64s: Blacklist rtas entry/exit from kprobes powerpc/64s: Blacklist functions invoked on a trap powerpc/64s: Un-blacklist system_call() from kprobes powerpc/64s: Move system_call() symbol to just after setting MSR_EE powerpc/64s: Blacklist system_call() and system_call_common() from kprobes powerpc/64s: Convert .L__replay_interrupt_return to a local label powerpc64/elfv1: Only dereference function descriptor for non-text symbols cxl: Export library to support IBM XSL powerpc/dts: Use #include "..." to include local DT powerpc/perf/hv-24x7: Aggregate result elements on POWER9 SMT8 ...
2017-07-06mm, memory_hotplug: move movable_node to the hotplug properMichal Hocko2-1/+6
movable_node_is_enabled is defined in memblock proper while it is initialized from the memory hotplug proper. This is quite messy and it makes a dependency between the two so move movable_node along with the helper functions to memory_hotplug. To make it more entertaining the kernel parameter is ignored unless CONFIG_HAVE_MEMBLOCK_NODE_MAP=y because we do not have the node information for each memblock otherwise. So let's warn when the option is disabled. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Acked-by: Reza Arbab <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Cc: Xishi Qiu <[email protected]> Cc: Kani Toshimitsu <[email protected]> Cc: Chen Yucong <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Andi Kleen <[email protected]> Cc: David Rientjes <[email protected]> Cc: Daniel Kiper <[email protected]> Cc: Igor Mammedov <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-06mm, memory_hotplug: drop CONFIG_MOVABLE_NODEMichal Hocko4-34/+0
Commit 20b2f52b73fe ("numa: add CONFIG_MOVABLE_NODE for movable-dedicated node") has introduced CONFIG_MOVABLE_NODE without a good explanation on why it is actually useful. It makes a lot of sense to make movable node semantic opt in but we already have that because the feature has to be explicitly enabled on the kernel command line. A config option on top only makes the configuration space larger without a good reason. It also adds an additional ifdefery that pollutes the code. Just drop the config option and make it de-facto always enabled. This shouldn't introduce any change to the semantic. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Acked-by: Reza Arbab <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Cc: Xishi Qiu <[email protected]> Cc: Kani Toshimitsu <[email protected]> Cc: Chen Yucong <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Andi Kleen <[email protected]> Cc: David Rientjes <[email protected]> Cc: Daniel Kiper <[email protected]> Cc: Igor Mammedov <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-06mm, memory_hotplug: drop artificial restriction on online/offlineMichal Hocko1-58/+0
Patch series "remove CONFIG_MOVABLE_NODE". I am continuing to clean up the memory hotplug code and CONFIG_MOVABLE_NODE seems dubious at best. The following two patches simply removes the flag and make it de-facto always enabled. The current semantic of the config option is twofold 1) it automatically binds hotplugable nodes to have memory in zone_movable by default when movable_node is enabled 2) forbids memory hotplug to online all the memory as movable when !CONFIG_MOVABLE_NODE. The later restriction is quite dubious because there is no clear cut of how much normal memory do we need for a reasonable system operation. A single memory block which is sufficient to allow further movable onlines is far from sufficient (e.g a node with >2GB and memblocks 128MB will fill up this zone with struct pages leaving nothing for other allocations). Removing the config option will not only reduce the configuration space it also removes quite some code. The semantic of the movable_node command line parameter is preserved. The first patch removes the restriction mentioned above and the second one simply removes all the CONFIG_MOVABLE_NODE related stuff. The last patch moves movable_node flag handling to memory_hotplug proper where it belongs. [1] http://lkml.kernel.org/r/[email protected] This patch (of 3): Commit 74d42d8fe146 ("memory_hotplug: ensure every online node has NORMAL memory") has introduced a restriction that every numa node has to have at least some memory in !movable zones before a first movable memory can be onlined if !CONFIG_MOVABLE_NODE. Likewise can_offline_normal checks the amount of normal memory in !movable zones and it disallows to offline memory if there is no normal memory left with a justification that "memory-management acts bad when we have nodes which is online but don't have any normal memory". While it is true that not having _any_ memory for kernel allocations on a NUMA node is far from great and such a node would be quite subotimal because all kernel allocations will have to fallback to another NUMA node but there is no reason to disallow such a configuration in principle. Besides that there is not really a big difference to have one memblock for ZONE_NORMAL available or none. With 128MB size memblocks the system might trash on the kernel allocations requests anyway. It is really hard to draw a line on how much normal memory is really sufficient so we have to rely on administrator to configure system sanely therefore drop the artificial restriction and remove can_offline_normal and can_online_high_movable altogether. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Acked-by: Reza Arbab <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Cc: Xishi Qiu <[email protected]> Cc: Kani Toshimitsu <[email protected]> Cc: Chen Yucong <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Andi Kleen <[email protected]> Cc: David Rientjes <[email protected]> Cc: Daniel Kiper <[email protected]> Cc: Igor Mammedov <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-06mm: memcontrol: account slab stats per lruvecJohannes Weiner3-27/+7
Josef's redesign of the balancing between slab caches and the page cache requires slab cache statistics at the lruvec level. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Acked-by: Vladimir Davydov <[email protected]> Cc: Josef Bacik <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-06mm: memcontrol: per-lruvec stats infrastructureJohannes Weiner4-22/+21
lruvecs are at the intersection of the NUMA node and memcg, which is the scope for most paging activity. Introduce a convenient accounting infrastructure that maintains statistics per node, per memcg, and the lruvec itself. Then convert over accounting sites for statistics that are already tracked in both nodes and memcgs and can be easily switched. [[email protected]: fix crash in the new cgroup stat keeping code] Link: http://lkml.kernel.org/r/[email protected] [[email protected]: don't track uncharged pages at all Link: http://lkml.kernel.org/r/[email protected] [[email protected]: add missing free_percpu()] Link: http://lkml.kernel.org/r/[email protected] [[email protected]: hexagon: fix build error caused by include file order] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Signed-off-by: Guenter Roeck <[email protected]> Acked-by: Vladimir Davydov <[email protected]> Cc: Josef Bacik <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-06mm: memcontrol: use generic mod_memcg_page_state for kmem pagesJohannes Weiner1-8/+8
The kmem-specific functions do the same thing. Switch and drop. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Acked-by: Vladimir Davydov <[email protected]> Cc: Josef Bacik <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-06mm: memcontrol: use the node-native slab memory countersJohannes Weiner2-6/+6
Now that the slab counters are moved from the zone to the node level we can drop the private memcg node stats and use the official ones. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Acked-by: Vladimir Davydov <[email protected]> Cc: Josef Bacik <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-06mm: vmstat: move slab statistics from zone to node countersJohannes Weiner5-13/+12
Patch series "mm: per-lruvec slab stats" Josef is working on a new approach to balancing slab caches and the page cache. For this to work, he needs slab cache statistics on the lruvec level. These patches implement that by adding infrastructure that allows updating and reading generic VM stat items per lruvec, then switches some existing VM accounting sites, including the slab accounting ones, to this new cgroup-aware API. I'll follow up with more patches on this, because there is actually substantial simplification that can be done to the memory controller when we replace private memcg accounting with making the existing VM accounting sites cgroup-aware. But this is enough for Josef to base his slab reclaim work on, so here goes. This patch (of 5): To re-implement slab cache vs. page cache balancing, we'll need the slab counters at the lruvec level, which, ever since lru reclaim was moved from the zone to the node, is the intersection of the node, not the zone, and the memcg. We could retain the per-zone counters for when the page allocator dumps its memory information on failures, and have counters on both levels - which on all but NUMA node 0 is usually redundant. But let's keep it simple for now and just move them. If anybody complains we can restore the per-zone counters. [[email protected]: fix oops] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Cc: Josef Bacik <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-06mm/zswap.c: delete an error message for a failed memory allocation in ↵Markus Elfring1-3/+2
zswap_dstmem_prepare() Omit an extra message for a memory allocation failure in this function. This issue was detected by using the Coccinelle software. Link: http://events.linuxfoundation.org/sites/events/files/slides/LCJ16-Refactor_Strings-WSang_0.pdf Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Markus Elfring <[email protected]> Cc: Dan Streetman <[email protected]> Cc: Seth Jennings <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-06mm/zswap.c: improve a size determination in zswap_frontswap_init()Markus Elfring1-1/+1
Replace the specification of a data structure by a pointer dereference as the parameter for the operator "sizeof" to make the corresponding size determination a bit safer according to the Linux coding style convention. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Markus Elfring <[email protected]> Cc: Dan Streetman <[email protected]> Cc: Seth Jennings <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-06mm/zswap.c: delete an error message for a failed memory allocation in ↵Markus Elfring1-3/+1
zswap_pool_create() Omit an extra message for a memory allocation failure in this function. This issue was detected by using the Coccinelle software. Link: http://events.linuxfoundation.org/sites/events/files/slides/LCJ16-Refactor_Strings-WSang_0.pdf Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Markus Elfring <[email protected]> Cc: Dan Streetman <[email protected]> Cc: Seth Jennings <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-06mm/swapfile.c: sort swap entries before freeHuang Ying1-0/+16
To reduce the lock contention of swap_info_struct->lock when freeing swap entry. The freed swap entries will be collected in a per-CPU buffer firstly, and be really freed later in batch. During the batch freeing, if the consecutive swap entries in the per-CPU buffer belongs to same swap device, the swap_info_struct->lock needs to be acquired/released only once, so that the lock contention could be reduced greatly. But if there are multiple swap devices, it is possible that the lock may be unnecessarily released/acquired because the swap entries belong to the same swap device are non-consecutive in the per-CPU buffer. To solve the issue, the per-CPU buffer is sorted according to the swap device before freeing the swap entries. With the patch, the memory (some swapped out) free time reduced 11.6% (from 2.65s to 2.35s) in the vm-scalability swap-w-rand test case with 16 processes. The test is done on a Xeon E5 v3 system. The swap device used is a RAM simulated PMEM (persistent memory) device. To test swapping, the test case creates 16 processes, which allocate and write to the anonymous pages until the RAM and part of the swap device is used up, finally the memory (some swapped out) is freed before exit. [[email protected]: tweak comment] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Huang Ying <[email protected]> Acked-by: Tim Chen <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-06mm/oom_kill: count global and memory cgroup oom killsKonstantin Khlebnikov3-0/+8
Show count of oom killer invocations in /proc/vmstat and count of processes killed in memory cgroup in knob "memory.events" (in memory.oom_control for v1 cgroup). Also describe difference between "oom" and "oom_kill" in memory cgroup documentation. Currently oom in memory cgroup kills tasks iff shortage has happened inside page fault. These counters helps in monitoring oom kills - for now the only way is grepping for magic words in kernel log. [[email protected]: fix for mem_cgroup_count_vm_event() rename] [[email protected]: fix comment, per Konstantin] Link: http://lkml.kernel.org/r/149570810989.203600.9492483715840752937.stgit@buzz Signed-off-by: Konstantin Khlebnikov <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Tetsuo Handa <[email protected]> Cc: Roman Guschin <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-06mm: per-cgroup memory reclaim statsRoman Gushchin6-12/+38
Track the following reclaim counters for every memory cgroup: PGREFILL, PGSCAN, PGSTEAL, PGACTIVATE, PGDEACTIVATE, PGLAZYFREE and PGLAZYFREED. These values are exposed using the memory.stats interface of cgroup v2. The meaning of each value is the same as for global counters, available using /proc/vmstat. Also, for consistency, rename mem_cgroup_count_vm_event() to count_memcg_event_mm(). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Roman Gushchin <[email protected]> Suggested-by: Johannes Weiner <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Vladimir Davydov <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Li Zefan <[email protected]> Cc: Balbir Singh <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-06mm: kmemleak: treat vm_struct as alternative reference to vmalloc'ed objectsCatalin Marinas2-10/+90
Kmemleak requires that vmalloc'ed objects have a minimum reference count of 2: one in the corresponding vm_struct object and the other owned by the vmalloc() caller. There are cases, however, where the original vmalloc() returned pointer is lost and, instead, a pointer to vm_struct is stored (see free_thread_stack()). Kmemleak currently reports such objects as leaks. This patch adds support for treating any surplus references to an object as additional references to a specified object. It introduces the kmemleak_vmalloc() API function which takes a vm_struct pointer and sets its surplus reference passing to the actual vmalloc() returned pointer. The __vmalloc_node_range() calling site has been modified accordingly. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Catalin Marinas <[email protected]> Reported-by: "Luis R. Rodriguez" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: "Luis R. Rodriguez" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-06mm: kmemleak: factor object reference updating out of scan_block()Catalin Marinas1-18/+25
scan_block() updates the number of references (pointers) to objects, adding them to the gray_list when object->min_count is reached. The patch factors out this functionality into a separate update_refs() function. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Catalin Marinas <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: "Luis R. Rodriguez" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-06mm: kmemleak: slightly reduce the size of some structures on 64-bit ↵Catalin Marinas1-3/+3
architectures Change the kmemleak_object.flags type to unsigned int and moves the early_log.min_count (int) near early_log.op_type (int) to slightly reduce the size of these structures on 64-bit architectures. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Catalin Marinas <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: "Luis R. Rodriguez" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-06mm, mempolicy: don't check cpuset seqlock where it doesn't matterVlastimil Babka1-16/+0
Two wrappers of __alloc_pages_nodemask() are checking task->mems_allowed_seq themselves to retry allocation that has raced with a cpuset update. This has been shown to be ineffective in preventing premature OOM's which can happen in __alloc_pages_slowpath() long before it returns back to the wrappers to detect the race at that level. Previous patches have made __alloc_pages_slowpath() more robust, so we can now simply remove the seqlock checking in the wrappers to prevent further wrong impression that it can actually help. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Vlastimil Babka <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: David Rientjes <[email protected]> Cc: Dimitri Sivanich <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Li Zefan <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-06mm, mempolicy: simplify rebinding mempolicies when updating cpusetsVlastimil Babka1-84/+18
Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when changing cpuset's mems") has introduced a two-step protocol when rebinding task's mempolicy due to cpuset update, in order to avoid a parallel allocation seeing an empty effective nodemask and failing. Later, commit cc9a6c877661 ("cpuset: mm: reduce large amounts of memory barrier related damage v3") introduced a seqlock protection and removed the synchronization point between the two update steps. At that point (or perhaps later), the two-step rebinding became unnecessary. Currently it only makes sure that the update first adds new nodes in step 1 and then removes nodes in step 2. Without memory barriers the effects are questionable, and even then this cannot prevent a parallel zonelist iteration checking the nodemask at each step to observe all nodes as unusable for allocation. We now fully rely on the seqlock to prevent premature OOMs and allocation failures. We can thus remove the two-step update parts and simplify the code. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Vlastimil Babka <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: David Rientjes <[email protected]> Cc: Dimitri Sivanich <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Li Zefan <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-06mm, page_alloc: pass preferred nid instead of zonelist to allocatorVlastimil Babka4-37/+35
The main allocator function __alloc_pages_nodemask() takes a zonelist pointer as one of its parameters. All of its callers directly or indirectly obtain the zonelist via node_zonelist() using a preferred node id and gfp_mask. We can make the code a bit simpler by doing the zonelist lookup in __alloc_pages_nodemask(), passing it a preferred node id instead (gfp_mask is already another parameter). There are some code size benefits thanks to removal of inlined node_zonelist(): bloat-o-meter add/remove: 2/2 grow/shrink: 4/36 up/down: 399/-1351 (-952) This will also make things simpler if we proceed with converting cpusets to zonelists. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Vlastimil Babka <[email protected]> Reviewed-by: Christoph Lameter <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Dimitri Sivanich <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: David Rientjes <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Li Zefan <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-06mm, mempolicy: stop adjusting current->il_next in mpol_rebind_nodemask()Vlastimil Babka1-15/+7
The task->il_next variable stores the next allocation node id for task's MPOL_INTERLEAVE policy. mpol_rebind_nodemask() updates interleave and bind mempolicies due to changing cpuset mems. Currently it also tries to make sure that current->il_next is valid within the updated nodemask. This is bogus, because 1) we are updating potentially any task's mempolicy, not just current, and 2) we might be updating a per-vma mempolicy, not task one. The interleave_nodes() function that uses il_next can cope fine with the value not being within the currently allowed nodes, so this hasn't manifested as an actual issue. We can remove the need for updating il_next completely by changing it to il_prev and store the node id of the previous interleave allocation instead of the next id. Then interleave_nodes() can calculate the next id using the current nodemask and also store it as il_prev, except when querying the next node via do_get_mempolicy(). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Vlastimil Babka <[email protected]> Reviewed-by: Christoph Lameter <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: David Rientjes <[email protected]> Cc: Dimitri Sivanich <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Li Zefan <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-06mm, page_alloc: fix more premature OOM due to race with cpuset updateVlastimil Babka1-13/+38
I would like to stress that this patchset aims to fix issues and cleanup the code *within the existing documented semantics*, i.e. patch 1 ignores mempolicy restrictions if the set of allowed nodes has no intersection with set of nodes allowed by cpuset. I believe discussing potential changes of the semantics can be better done once we have a baseline with no known bugs of the current semantics. I've recently summarized the cpuset/mempolicy issues in a LSF/MM proposal [1] and the discussion itself [2]. I've been trying to rewrite the handling as proposed, with the idea that changing semantics to make all mempolicies static wrt cpuset updates (and discarding the relative and default modes) can be tried on top, as there's a high risk of being rejected/reverted because somebody might still care about the removed modes. However I haven't yet figured out how to properly: 1) make mempolicies swappable instead of rebinding in place. I thought mbind() already works that way and uses refcounting to avoid use-after-free of the old policy by a parallel allocation, but turns out true refcounting is only done for shared (shmem) mempolicies, and the actual protection for mbind() comes from mmap_sem. Extending the refcounting means more overhead in allocator hot path. Also swapping whole mempolicies means that we have to allocate the new ones, which can fail, and reverting of the partially done work also means allocating (note that mbind() doesn't care and will just leave part of the range updated and part not updated when returning -ENOMEM...). 2) make cpuset's task->mems_allowed also swappable (after converting it from nodemask to zonelist, which is the easy part) for mostly the same reasons. The good news is that while trying to do the above, I've at least figured out how to hopefully close the remaining premature OOM's, and do a buch of cleanups on top, removing quite some of the code that was also supposed to prevent the cpuset update races, but doesn't work anymore nowadays. This should fix the most pressing concerns with this topic and give us a better baseline before either proceeding with the original proposal, or pushing a change of semantics that removes the problem 1) above. I'd be then fine with trying to change the semantic first and rewrite later. Patchset has been tested with the LTP cpuset01 stress test. [1] https://lkml.kernel.org/r/[email protected] [2] https://lwn.net/Articles/717797/ [3] https://marc.info/?l=linux-mm&m=149191957922828&w=2 This patch (of 6): Commit e47483bca2cc ("mm, page_alloc: fix premature OOM when racing with cpuset mems update") has fixed known recent regressions found by LTP's cpuset01 testcase. I have however found that by modifying the testcase to use per-vma mempolicies via bind(2) instead of per-task mempolicies via set_mempolicy(2), the premature OOM still happens and the issue is much older. The root of the problem is that the cpuset's mems_allowed and mempolicy's nodemask can temporarily have no intersection, thus get_page_from_freelist() cannot find any usable zone. The current semantic for empty intersection is to ignore mempolicy's nodemask and honour cpuset restrictions. This is checked in node_zonelist(), but the racy update can happen after we already passed the check. Such races should be protected by the seqlock task->mems_allowed_seq, but it doesn't work here, because 1) mpol_rebind_mm() does not happen under seqlock for write, and doing so would lead to deadlock, as it takes mmap_sem for write, while the allocation can have mmap_sem for read when it's taking the seqlock for read. And 2) the seqlock cookie of callers of node_zonelist() (alloc_pages_vma() and alloc_pages_current()) is different than the one of __alloc_pages_slowpath(), so there's still a potential race window. This patch fixes the issue by having __alloc_pages_slowpath() check for empty intersection of cpuset and ac->nodemask before OOM or allocation failure. If it's indeed empty, the nodemask is ignored and allocation retried, which mimics node_zonelist(). This works fine, because almost all callers of __alloc_pages_nodemask are obtaining the nodemask via node_zonelist(). The only exception is new_node_page() from hotplug, where the potential violation of nodemask isn't an issue, as there's already a fallback allocation attempt without any nodemask. If there's a future caller that needs to have its specific nodemask honoured over task's cpuset restrictions, we'll have to e.g. add a gfp flag for that. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Vlastimil Babka <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Li Zefan <[email protected]> Cc: Mel Gorman <[email protected]> Cc: David Rientjes <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Dimitri Sivanich <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-06mm: rmap: use correct helper when poisoning hugepagesPunit Agrawal1-2/+5
Using set_pte_at() does not do the right thing when putting down HWPOISON swap entries for hugepages on architectures that support contiguous ptes. Fix this problem by using set_huge_swap_pte_at() which was introduced to fix exactly this problem. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Punit Agrawal <[email protected]> Acked-by: Steve Capper <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Will Deacon <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-06mm/hugetlb: introduce set_huge_swap_pte_at() helperPunit Agrawal1-3/+5
set_huge_pte_at(), an architecture callback to populate hugepage ptes, does not provide the range of virtual memory that is targeted. This leads to ambiguity when dealing with swap entries on architectures that support hugepages consisting of contiguous ptes. Fix the problem by introducing an overridable helper that is called when populating the page tables with swap entries. The size of the targeted region is provided to the helper to help determine the number of entries to be updated. Provide a default implementation that maintains the current behaviour. [[email protected]: v4] Link: http://lkml.kernel.org/r/[email protected] [[email protected]: add an empty definition for set_huge_swap_pte_at()] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Punit Agrawal <[email protected]> Acked-by: Steve Capper <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Will Deacon <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-06mm/hugetlb: allow architectures to override huge_pte_clear()Punit Agrawal1-1/+1
When unmapping a hugepage range, huge_pte_clear() is used to clear the page table entries that are marked as not present. huge_pte_clear() internally just ends up calling pte_clear() which does not correctly deal with hugepages consisting of contiguous page table entries. Add a size argument to address this issue and allow architectures to override huge_pte_clear() by wrapping it in a #ifndef block. Update s390 implementation with the size parameter as well. Note that the change only affects huge_pte_clear() - the other generic hugetlb functions don't need any change. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Punit Agrawal <[email protected]> Acked-by: Martin Schwidefsky <[email protected]> [s390 bits] Cc: Heiko Carstens <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Will Deacon <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Steve Capper <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-06mm/hugetlb: add size parameter to huge_pte_offset()Punit Agrawal3-11/+18
A poisoned or migrated hugepage is stored as a swap entry in the page tables. On architectures that support hugepages consisting of contiguous page table entries (such as on arm64) this leads to ambiguity in determining the page table entry to return in huge_pte_offset() when a poisoned entry is encountered. Let's remove the ambiguity by adding a size parameter to convey additional information about the requested address. Also fixup the definition/usage of huge_pte_offset() throughout the tree. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Punit Agrawal <[email protected]> Acked-by: Steve Capper <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Will Deacon <[email protected]> Cc: Tony Luck <[email protected]> Cc: Fenghua Yu <[email protected]> Cc: James Hogan <[email protected]> (odd fixer:METAG ARCHITECTURE) Cc: Ralf Baechle <[email protected]> (supporter:MIPS) Cc: "James E.J. Bottomley" <[email protected]> Cc: Helge Deller <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Yoshinori Sato <[email protected]> Cc: Rich Felker <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Mark Rutland <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-06mm, gup: ensure real head page is ref-counted when using hugepagesPunit Agrawal1-6/+6
When speculatively taking references to a hugepage using page_cache_add_speculative() in gup_huge_pmd(), it is assumed that the page returned by pmd_page() is the head page. Although normally true, this assumption doesn't hold when the hugepage comprises of successive page table entries such as when using contiguous bit on arm64 at PTE or PMD levels. This can be addressed by ensuring that the page passed to page_cache_add_speculative() is the real head or by de-referencing the head page within the function. We take the first approach to keep the usage pattern aligned with page_cache_get_speculative() where users already pass the appropriate page, i.e., the de-referenced head. Apply the same logic to fix gup_huge_[pud|pgd]() as well. [[email protected]: fix arm64 ltp failure] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Punit Agrawal <[email protected]> Acked-by: Steve Capper <[email protected]> Cc: Michal Hocko <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Will Deacon <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-06mm, gup: remove broken VM_BUG_ON_PAGE compound check for hugepagesWill Deacon1-3/+0
When operating on hugepages with DEBUG_VM enabled, the GUP code checks the compound head for each tail page prior to calling page_cache_add_speculative. This is broken, because on the fast-GUP path (where we don't hold any page table locks) we can be racing with a concurrent invocation of split_huge_page_to_list. split_huge_page_to_list deals with this race by using page_ref_freeze to freeze the page and force concurrent GUPs to fail whilst the component pages are modified. This modification includes clearing the compound_head field for the tail pages, so checking this prior to a successful call to page_cache_add_speculative can lead to false positives: In fact, page_cache_add_speculative *already* has this check once the page refcount has been successfully updated, so we can simply remove the broken calls to VM_BUG_ON_PAGE. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> Signed-off-by: Punit Agrawal <[email protected]> Acked-by: Steve Capper <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-06mm: drop NULL return check of pte_offset_map_lock()Naoya Horiguchi2-4/+0
pte_offset_map_lock() finds and takes ptl, and returns pte. But some callers return without unlocking the ptl when pte == NULL, which seems weird. Git history said that !pte check in change_pte_range() was introduced in commit 1ad9f620c3a2 ("mm: numa: recheck for transhuge pages under lock during protection changes") and still remains after commit 175ad4f1e7a2 ("mm: mprotect: use pmd_trans_unstable instead of taking the pmd_lock") which partially reverts 1ad9f620c3a2. So I think that it's just dead code. Many other caller of pte_offset_map_lock() never check NULL return, so let's do likewise. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Naoya Horiguchi <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-06mm/page_alloc.c: mark bad_range() and meminit_pfn_in_nid() as __maybe_unusedMatthias Kaehlcke1-6/+8
The functions are not used in some configurations. Adding the attribute fixes the following warnings when building with clang: mm/page_alloc.c:409:19: error: function 'bad_range' is not needed and will not be emitted [-Werror,-Wunneeded-internal-declaration] mm/page_alloc.c:1106:30: error: unused function 'meminit_pfn_in_nid' [-Werror,-Wunused-function] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Matthias Kaehlcke <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-06mm/hugetlb: clean up ARCH_HAS_GIGANTIC_PAGEAneesh Kumar K.V1-5/+2
This moves the #ifdef in C code to a Kconfig dependency. Also we move the gigantic_page_supported() function to be arch specific. This allows architectures to conditionally enable runtime allocation of gigantic huge page. Architectures like ppc64 supports different gigantic huge page size (16G and 1G) based on the translation mode selected. This provides an opportunity for ppc64 to enable runtime allocation only w.r.t 1G hugepage. No functional change in this patch. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Aneesh Kumar K.V <[email protected]> Cc: Michael Ellerman <[email protected]> (powerpc) Cc: Anshuman Khandual <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-06mm: adaptive hash table scalingPavel Tatashin1-0/+25
Allow hash tables to scale with memory but at slower pace, when HASH_ADAPT is provided every time memory quadruples the sizes of hash tables will only double instead of quadrupling as well. This algorithm starts working only when memory size reaches a certain point, currently set to 64G. This is example of dentry hash table size, before and after four various memory configurations: MEMORY SCALE HASH_SIZE old new old new 8G 13 13 8M 8M 16G 13 13 16M 16M 32G 13 13 32M 32M 64G 13 13 64M 64M 128G 13 14 128M 64M 256G 13 14 256M 128M 512G 13 15 512M 128M 1024G 13 15 1024M 256M 2048G 13 16 2048M 256M 4096G 13 16 4096M 512M 8192G 13 17 8192M 512M 16384G 13 17 16384M 1024M 32768G 13 18 32768M 1024M 65536G 13 18 65536M 2048M The effect of this change on runtime is undetectable as filesystem growth is not proportional to machine memory size as is currently assumed. The change effects only large memory machine. Additional tuning might be needed, but that can be done by the clients of the kmem_cache_create interface, not the generic cache allocator itself. The adaptive hashing is disabled on 32 bit systems to avoid confusion of whether base should be different for smaller systems, and to avoid overflows. [[email protected]: drop HASH_ADAPT] Link: http://lkml.kernel.org/r/[email protected] [[email protected]: UL -> ULL fix] Link: http://lkml.kernel.org/r/[email protected] [[email protected]: disable adaptive hash on 32 bit systems] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Pavel Tatashin <[email protected]> Signed-off-by: Michal Hocko <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: David Miller <[email protected]> Cc: Al Viro <[email protected]> Cc: Babu Moger <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-06mm: zero hash tables in allocatorPavel Tatashin1-3/+9
Add a new flag HASH_ZERO which when provided grantees that the hash table that is returned by alloc_large_system_hash() is zeroed. In most cases that is what is needed by the caller. Use page level allocator's __GFP_ZERO flags to zero the memory. It is using memset() which is efficient method to zero memory and is optimized for most platforms. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Pavel Tatashin <[email protected]> Reviewed-by: Babu Moger <[email protected]> Cc: David Miller <[email protected]> Cc: Al Viro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-06mm/follow_page_mask: add support for hugepage directory entryAneesh Kumar K.V2-0/+41
Architectures like ppc64 supports hugepage size that is not mapped to any of of the page table levels. Instead they add an alternate page table entry format called hugepage directory (hugepd). hugepd indicates that the page table entry maps to a set of hugetlb pages. Add support for this in generic follow_page_mask code. We already support this format in the generic gup code. The default implementation prints warning and returns NULL. We will add ppc64 support in later patches Link: http://lkml.kernel.org/r/1494926612-23928-7-git-send-email-aneesh.kumar@linux.vnet.ibm.com Signed-off-by: Aneesh Kumar K.V <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-06mm/follow_page_mask: add support for hugetlb pgd entriesAnshuman Khandual2-0/+16
ppc64 supports pgd hugetlb entries. Add code to handle hugetlb pgd entries to follow_page_mask so that ppc64 can switch to it to handle hugetlbe entries. Link: http://lkml.kernel.org/r/1494926612-23928-5-git-send-email-aneesh.kumar@linux.vnet.ibm.com Signed-off-by: Anshuman Khandual <[email protected]> Signed-off-by: Aneesh Kumar K.V <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-06mm/hugetlb: export hugetlb_entry_migration helperAneesh Kumar K.V1-4/+4
We will be using this later from the ppc64 code. Change the return type to bool. Link: http://lkml.kernel.org/r/1494926612-23928-4-git-send-email-aneesh.kumar@linux.vnet.ibm.com Signed-off-by: Aneesh Kumar K.V <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-06mm/follow_page_mask: split follow_page_mask to smaller functions.Aneesh Kumar K.V1-57/+91
Makes code reading easy. No functional changes in this patch. In a followup patch, we will be updating the follow_page_mask to handle hugetlb hugepd format so that archs like ppc64 can switch to the generic version. This split helps in doing that nicely. Link: http://lkml.kernel.org/r/1494926612-23928-3-git-send-email-aneesh.kumar@linux.vnet.ibm.com Signed-off-by: Aneesh Kumar K.V <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-06mm/hugetlb/migration: use set_huge_pte_at instead of set_pte_atAneesh Kumar K.V1-10/+11
Patch series "HugeTLB migration support for PPC64", v2. This patch (of 9): The right interface to use to set a hugetlb pte entry is set_huge_pte_at. Use that instead of set_pte_at. Link: http://lkml.kernel.org/r/1494926612-23928-2-git-send-email-aneesh.kumar@linux.vnet.ibm.com Signed-off-by: Aneesh Kumar K.V <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-06mm/madvise: enable (soft|hard) offline of HugeTLB pages at PGD levelAnshuman Khandual2-5/+26
Though migrating gigantic HugeTLB pages does not sound much like real world use case, they can be affected by memory errors. Hence migration at the PGD level HugeTLB pages should be supported just to enable soft and hard offline use cases. While allocating the new gigantic HugeTLB page, it should not matter whether new page comes from the same node or not. There would be very few gigantic pages on the system afterall, we should not be bothered about node locality when trying to save a big page from crashing. This change renames dequeu_huge_page_node() function as dequeue_huge _page_node_exact() preserving it's original functionality. Now the new dequeue_huge_page_node() function scans through all available online nodes to allocate a huge page for the NUMA_NO_NODE case and just falls back calling dequeu_huge_page_node_exact() for all other cases. [[email protected]: make hstate_is_gigantic() inline] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Anshuman Khandual <[email protected]> Signed-off-by: Arnd Bergmann <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-06mm, memory_hotplug: remove unused cruft after memory hotplug reworkMichal Hocko1-207/+0
zone_for_memory doesn't have any user anymore as well as the whole zone shifting infrastructure so drop them all. This shouldn't introduce any functional changes. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Dan Williams <[email protected]> Cc: Daniel Kiper <[email protected]> Cc: David Rientjes <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Igor Mammedov <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Reza Arbab <[email protected]> Cc: Tobias Regnery <[email protected]> Cc: Toshi Kani <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Cc: Xishi Qiu <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-06mm, memory_hotplug: fix the section mismatch warningMichal Hocko1-1/+1
Tobias has reported following section mismatches introduced by "mm, memory_hotplug: do not associate hotadded memory to zones until online". WARNING: mm/built-in.o(.text+0x5a1c2): Section mismatch in reference from the function move_pfn_range_to_zone() to the function .meminit.text:memmap_init_zone() The function move_pfn_range_to_zone() references the function __meminit memmap_init_zone(). This is often because move_pfn_range_to_zone lacks a __meminit annotation or the annotation of memmap_init_zone is wrong. WARNING: mm/built-in.o(.text+0x5a25b): Section mismatch in reference from the function move_pfn_range_to_zone() to the function .meminit.text:init_currently_empty_zone() The function move_pfn_range_to_zone() references the function __meminit init_currently_empty_zone(). This is often because move_pfn_range_to_zone lacks a __meminit annotation or the annotation of init_currently_empty_zone is wrong. WARNING: vmlinux.o(.text+0x188aa2): Section mismatch in reference from the function move_pfn_range_to_zone() to the function .meminit.text:memmap_init_zone() The function move_pfn_range_to_zone() references the function __meminit memmap_init_zone(). This is often because move_pfn_range_to_zone lacks a __meminit annotation or the annotation of memmap_init_zone is wrong. WARNING: vmlinux.o(.text+0x188b3b): Section mismatch in reference from the function move_pfn_range_to_zone() to the function .meminit.text:init_currently_empty_zone() The function move_pfn_range_to_zone() references the function __meminit init_currently_empty_zone(). This is often because move_pfn_range_to_zone lacks a __meminit annotation or the annotation of init_currently_empty_zone is wrong. Both memmap_init_zone and init_currently_empty_zone are marked __meminit but move_pfn_range_to_zone is used outside of __meminit sections (e.g. devm_memremap_pages) so we have to hide it from the checker by __ref annotation. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Reported-by: Tobias Regnery <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Dan Williams <[email protected]> Cc: Daniel Kiper <[email protected]> Cc: David Rientjes <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Igor Mammedov <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Reza Arbab <[email protected]> Cc: Toshi Kani <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Xishi Qiu <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-06mm, memory_hotplug: replace for_device by want_memblock in arch_add_memoryMichal Hocko1-1/+1
arch_add_memory gets for_device argument which then controls whether we want to create memblocks for created memory sections. Simplify the logic by telling whether we want memblocks directly rather than going through pointless negation. This also makes the api easier to understand because it is clear what we want rather than nothing telling for_device which can mean anything. This shouldn't introduce any functional change. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Tested-by: Dan Williams <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Daniel Kiper <[email protected]> Cc: David Rientjes <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Igor Mammedov <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Reza Arbab <[email protected]> Cc: Tobias Regnery <[email protected]> Cc: Toshi Kani <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Cc: Xishi Qiu <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-06mm, memory_hotplug: do not assume ZONE_NORMAL is default kernel zoneMichal Hocko1-3/+24
Heiko Carstens has noticed that he can generate overlapping zones for ZONE_DMA and ZONE_NORMAL: DMA [mem 0x0000000000000000-0x000000007fffffff] Normal [mem 0x0000000080000000-0x000000017fffffff] $ cat /sys/devices/system/memory/block_size_bytes 10000000 $ cat /sys/devices/system/memory/memory5/valid_zones DMA $ echo 0 > /sys/devices/system/memory/memory5/online $ cat /sys/devices/system/memory/memory5/valid_zones Normal $ echo 1 > /sys/devices/system/memory/memory5/online Normal $ cat /proc/zoneinfo Node 0, zone DMA spanned 524288 <----- present 458752 managed 455078 start_pfn: 0 <----- Node 0, zone Normal spanned 720896 present 589824 managed 571648 start_pfn: 327680 <----- The reason is that we assume that the default zone for kernel onlining is ZONE_NORMAL. This was a simplification introduced by the memory hotplug rework and it is easily fixable by checking the range overlap in the zone order and considering the first matching zone as the default one. If there is no such zone then assume ZONE_NORMAL as we have been doing so far. Fixes: "mm, memory_hotplug: do not associate hotadded memory to zones until online" Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Reported-by: Heiko Carstens <[email protected]> Tested-by: Heiko Carstens <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Dan Williams <[email protected]> Cc: Reza Arbab <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-06mm, memory_hotplug: fix MMOP_ONLINE_KEEP behaviorMichal Hocko1-4/+5
Heiko Carstens has noticed that the MMOP_ONLINE_KEEP is broken currently $ grep . memory3?/valid_zones memory34/valid_zones:Normal Movable memory35/valid_zones:Normal Movable memory36/valid_zones:Normal Movable memory37/valid_zones:Normal Movable $ echo online_movable > memory34/state $ grep . memory3?/valid_zones memory34/valid_zones:Movable memory35/valid_zones:Movable memory36/valid_zones:Movable memory37/valid_zones:Movable $ echo online > memory36/state $ grep . memory3?/valid_zones memory34/valid_zones:Movable memory36/valid_zones:Normal memory37/valid_zones:Movable so we have effectively punched a hole into the movable zone. The problem is that move_pfn_range() check for MMOP_ONLINE_KEEP is wrong. It only checks whether the given range is already part of the movable zone which is not the case here as only memory34 is in the zone. Fix this by using allow_online_pfn_range(..., MMOP_ONLINE_KERNEL) if that is false then we can be sure that movable onlining is the right thing to do. Fixes: "mm, memory_hotplug: do not associate hotadded memory to zones until online" Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Reported-by: Heiko Carstens <[email protected]> Tested-by: Heiko Carstens <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Dan Williams <[email protected]> Cc: Reza Arbab <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-06mm, memory_hotplug: do not associate hotadded memory to zones until onlineMichal Hocko2-81/+123
The current memory hotplug implementation relies on having all the struct pages associate with a zone/node during the physical hotplug phase (arch_add_memory->__add_pages->__add_section->__add_zone). In the vast majority of cases this means that they are added to ZONE_NORMAL. This has been so since 9d99aaa31f59 ("[PATCH] x86_64: Support memory hotadd without sparsemem") and it wasn't a big deal back then because movable onlining didn't exist yet. Much later memory hotplug wanted to (ab)use ZONE_MOVABLE for movable onlining 511c2aba8f07 ("mm, memory-hotplug: dynamic configure movable memory and portion memory") and then things got more complicated. Rather than reconsidering the zone association which was no longer needed (because the memory hotplug already depended on SPARSEMEM) a convoluted semantic of zone shifting has been developed. Only the currently last memblock or the one adjacent to the zone_movable can be onlined movable. This essentially means that the online type changes as the new memblocks are added. Let's simulate memory hot online manually $ echo 0x100000000 > /sys/devices/system/memory/probe $ grep . /sys/devices/system/memory/memory32/valid_zones Normal Movable $ echo $((0x100000000+(128<<20))) > /sys/devices/system/memory/probe $ grep . /sys/devices/system/memory/memory3?/valid_zones /sys/devices/system/memory/memory32/valid_zones:Normal /sys/devices/system/memory/memory33/valid_zones:Normal Movable $ echo $((0x100000000+2*(128<<20))) > /sys/devices/system/memory/probe $ grep . /sys/devices/system/memory/memory3?/valid_zones /sys/devices/system/memory/memory32/valid_zones:Normal /sys/devices/system/memory/memory33/valid_zones:Normal /sys/devices/system/memory/memory34/valid_zones:Normal Movable $ echo online_movable > /sys/devices/system/memory/memory34/state $ grep . /sys/devices/system/memory/memory3?/valid_zones /sys/devices/system/memory/memory32/valid_zones:Normal /sys/devices/system/memory/memory33/valid_zones:Normal Movable /sys/devices/system/memory/memory34/valid_zones:Movable Normal This is an awkward semantic because an udev event is sent as soon as the block is onlined and an udev handler might want to online it based on some policy (e.g. association with a node) but it will inherently race with new blocks showing up. This patch changes the physical online phase to not associate pages with any zone at all. All the pages are just marked reserved and wait for the onlining phase to be associated with the zone as per the online request. There are only two requirements - existing ZONE_NORMAL and ZONE_MOVABLE cannot overlap - ZONE_NORMAL precedes ZONE_MOVABLE in physical addresses the latter one is not an inherent requirement and can be changed in the future. It preserves the current behavior and made the code slightly simpler. This is subject to change in future. This means that the same physical online steps as above will lead to the following state: Normal Movable /sys/devices/system/memory/memory32/valid_zones:Normal Movable /sys/devices/system/memory/memory33/valid_zones:Normal Movable /sys/devices/system/memory/memory32/valid_zones:Normal Movable /sys/devices/system/memory/memory33/valid_zones:Normal Movable /sys/devices/system/memory/memory34/valid_zones:Normal Movable /sys/devices/system/memory/memory32/valid_zones:Normal Movable /sys/devices/system/memory/memory33/valid_zones:Normal Movable /sys/devices/system/memory/memory34/valid_zones:Movable Implementation: The current move_pfn_range is reimplemented to check the above requirements (allow_online_pfn_range) and then updates the respective zone (move_pfn_range_to_zone), the pgdat and links all the pages in the pfn range with the zone/node. __add_pages is updated to not require the zone and only initializes sections in the range. This allowed to simplify the arch_add_memory code (s390 could get rid of quite some of code). devm_memremap_pages is the only user of arch_add_memory which relies on the zone association because it only hooks into the memory hotplug only half way. It uses it to associate the new memory with ZONE_DEVICE but doesn't allow it to be {on,off}lined via sysfs. This means that this particular code path has to call move_pfn_range_to_zone explicitly. The original zone shifting code is kept in place and will be removed in the follow up patch for an easier review. Please note that this patch also changes the original behavior when offlining a memory block adjacent to another zone (Normal vs. Movable) used to allow to change its movable type. This will be handled later. [[email protected]: simplify zone_intersects()] Link: http://lkml.kernel.org/r/[email protected] [[email protected]: remove duplicate call for set_page_links] Link: http://lkml.kernel.org/r/[email protected] [[email protected]: remove unused local `i'] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Signed-off-by: Wei Yang <[email protected]> Tested-by: Dan Williams <[email protected]> Tested-by: Reza Arbab <[email protected]> Acked-by: Heiko Carstens <[email protected]> # For s390 bits Acked-by: Vlastimil Babka <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Daniel Kiper <[email protected]> Cc: David Rientjes <[email protected]> Cc: Igor Mammedov <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Tobias Regnery <[email protected]> Cc: Toshi Kani <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Cc: Xishi Qiu <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-06mm, vmstat: skip reporting offline pages in pagetypeinfoMichal Hocko1-3/+2
pagetypeinfo_showblockcount_print skips over invalid pfns but it would report pages which are offline because those have a valid pfn. Their migrate type is misleading at best. Now that we have pfn_to_online_page() we can use it instead of pfn_valid() and fix this. [[email protected]: fix build] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Reported-by: Joonsoo Kim <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Dan Williams <[email protected]> Cc: Daniel Kiper <[email protected]> Cc: David Rientjes <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Igor Mammedov <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Reza Arbab <[email protected]> Cc: Tobias Regnery <[email protected]> Cc: Toshi Kani <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Xishi Qiu <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>