blaster4385/linux-IllusionX - Linux kernel with personal config changes for arch linux

Age	Commit message (Collapse)	Author	Files	Lines
2019-07-12	mm, oom: remove redundant task_in_mem_cgroup() check	Shakeel Butt	2	-38/+7
	oom_unkillable_task() can be called from three different contexts i.e. global OOM, memcg OOM and oom_score procfs interface. At the moment oom_unkillable_task() does a task_in_mem_cgroup() check on the given process. Since there is no reason to perform task_in_mem_cgroup() check for global OOM and oom_score procfs interface, those contexts provide NULL memcg and skips the task_in_mem_cgroup() check. However for memcg OOM context, the oom_unkillable_task() is always called from mem_cgroup_scan_tasks() and thus task_in_mem_cgroup() check becomes redundant and effectively dead code. So, just remove the task_in_mem_cgroup() check altogether. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Shakeel Butt <[email protected]> Signed-off-by: Tetsuo Handa <[email protected]> Acked-by: Roman Gushchin <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: David Rientjes <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Nick Piggin <[email protected]> Cc: Paul Jackson <[email protected]> Cc: Vladimir Davydov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-07-12	mm, oom: refactor dump_tasks for memcg OOMs	Shakeel Butt	1	-28/+40
	dump_tasks() traverses all the existing processes even for the memcg OOM context which is not only unnecessary but also wasteful. This imposes a long RCU critical section even from a contained context which can be quite disruptive. Change dump_tasks() to be aligned with select_bad_process and use mem_cgroup_scan_tasks to selectively traverse only processes of the target memcg hierarchy during memcg OOM. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Shakeel Butt <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Roman Gushchin <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Tetsuo Handa <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: David Rientjes <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Paul Jackson <[email protected]> Cc: Nick Piggin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-07-12	mm: memcontrol: use CSS_TASK_ITER_PROCS at mem_cgroup_scan_tasks()	Tetsuo Handa	2	-4/+1
	Since commit c03cd7738a83 ("cgroup: Include dying leaders with live threads in PROCS iterations") corrected how CSS_TASK_ITER_PROCS works, mem_cgroup_scan_tasks() can use CSS_TASK_ITER_PROCS in order to check only one thread from each thread group. [[email protected]: remove thread group leader check in oom_evaluate_task()] Link: http://lkml.kernel.org/r/1560853257-14934-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Tetsuo Handa <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-07-12	mm/memory-failure.c: clarify error message	Jane Chu	1	-1/+1
	Some user who install SIGBUS handler that does longjmp out therefore keeping the process alive is confused by the error message "[188988.765862] Memory failure: 0x1840200: Killing cellsrv:33395 due to hardware memory corruption" Slightly modify the error message to improve clarity. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jane Chu <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Acked-by: Pankaj Gupta <[email protected]> Reviewed-by: Anshuman Khandual <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-07-12	mm: vmalloc: show number of vmalloc pages in /proc/meminfo	Roman Gushchin	1	-0/+10
	Vmalloc() is getting more and more used these days (kernel stacks, bpf and percpu allocator are new top users), and the total % of memory consumed by vmalloc() can be pretty significant and changes dynamically. /proc/meminfo is the best place to display this information: its top goal is to show top consumers of the memory. Since the VmallocUsed field in /proc/meminfo is not in use for quite a long time (it has been defined to 0 by a5ad88ce8c7f ("mm: get rid of 'vmalloc_info' from /proc/meminfo")), let's reuse it for showing the actual physical memory consumption of vmalloc(). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Roman Gushchin <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-07-12	mm: use down_read_killable for locking mmap_sem in access_remote_vm	Konstantin Khlebnikov	2	-2/+5
	This function is used by ptrace and proc files like /proc/pid/cmdline and /proc/pid/environ. Access_remote_vm never returns error codes, all errors are ignored and only size of successfully read data is returned. So, if current task was killed we'll simply return 0 (bytes read). Mmap_sem could be locked for a long time or forever if something goes wrong. Using a killable lock permits cleanup of stuck tasks and simplifies investigation. Link: http://lkml.kernel.org/r/156007494202.3335.16782303099589302087.stgit@buzz Signed-off-by: Konstantin Khlebnikov <[email protected]> Reviewed-by: Michal Koutný <[email protected]> Acked-by: Oleg Nesterov <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Cyrill Gorcunov <[email protected]> Cc: Kirill Tkhai <[email protected]> Cc: Al Viro <[email protected]> Cc: Roman Gushchin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-07-12	mm: vmscan: correct some vmscan counters for THP swapout	Yang Shi	1	-14/+49
	Commit bd4c82c22c36 ("mm, THP, swap: delay splitting THP after swapped out"), THP can be swapped out in a whole. But, nr_reclaimed and some other vm counters still get inc'ed by one even though a whole THP (512 pages) gets swapped out. This doesn't make too much sense to memory reclaim. For example, direct reclaim may just need reclaim SWAP_CLUSTER_MAX pages, reclaiming one THP could fulfill it. But, if nr_reclaimed is not increased correctly, direct reclaim may just waste time to reclaim more pages, SWAP_CLUSTER_MAX * 512 pages in worst case. And, it may cause pgsteal_{kswapd\|direct} is greater than pgscan_{kswapd\|direct}, like the below: pgsteal_kswapd 122933 pgsteal_direct 26600225 pgscan_kswapd 174153 pgscan_direct 14678312 nr_reclaimed and nr_scanned must be fixed in parallel otherwise it would break some page reclaim logic, e.g. vmpressure: this looks at the scanned/reclaimed ratio so it won't change semantics as long as scanned & reclaimed are fixed in parallel. compaction/reclaim: compaction wants a certain number of physical pages freed up before going back to compacting. kswapd priority raising: kswapd raises priority if we scan fewer pages than the reclaim target (which itself is obviously expressed in order-0 pages). As a result, kswapd can falsely raise its aggressiveness even when it's making great progress. Other than nr_scanned and nr_reclaimed, some other counters, e.g. pgactivate, nr_skipped, nr_ref_keep and nr_unmap_fail need to be fixed too since they are user visible via cgroup, /proc/vmstat or trace points, otherwise they would be underreported. When isolating pages from LRUs, nr_taken has been accounted in base page, but nr_scanned and nr_skipped are still accounted in THP. It doesn't make too much sense too since this may cause trace point underreport the numbers as well. So accounting those counters in base page instead of accounting THP as one page. nr_dirty, nr_unqueued_dirty, nr_congested and nr_writeback are used by file cache, so they are not impacted by THP swap. This change may result in lower steal/scan ratio in some cases since THP may get split during page reclaim, then a part of tail pages get reclaimed instead of the whole 512 pages, but nr_scanned is accounted by 512, particularly for direct reclaim. But, this should be not a significant issue. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Shi <[email protected]> Reviewed-by: "Huang, Ying" <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mel Gorman <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Josef Bacik <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-07-12	mm: vmscan: remove double slab pressure by inc'ing sc->nr_scanned	Yang Shi	1	-5/+0
	Commit 9092c71bb724 ("mm: use sc->priority for slab shrink targets") has broken up the relationship between sc->nr_scanned and slab pressure. The sc->nr_scanned can't double slab pressure anymore. So, it sounds no sense to still keep sc->nr_scanned inc'ed. Actually, it would prevent from adding pressure on slab shrink since excessive sc->nr_scanned would prevent from scan->priority raise. The bonnie test doesn't show this would change the behavior of slab shrinkers. w/ w/o /sec %CP /sec %CP Sequential delete: 3960.6 94.6 3997.6 96.2 Random delete: 2518 63.8 2561.6 64.6 The slight increase of "/sec" without the patch would be caused by the slight increase of CPU usage. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Shi <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Josef Bacik <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mel Gorman <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Hillf Danton <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-07-12	mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options	Alexander Potapenko	5	-16/+135
	Patch series "add init_on_alloc/init_on_free boot options", v10. Provide init_on_alloc and init_on_free boot options. These are aimed at preventing possible information leaks and making the control-flow bugs that depend on uninitialized values more deterministic. Enabling either of the options guarantees that the memory returned by the page allocator and SL[AU]B is initialized with zeroes. SLOB allocator isn't supported at the moment, as its emulation of kmem caches complicates handling of SLAB_TYPESAFE_BY_RCU caches correctly. Enabling init_on_free also guarantees that pages and heap objects are initialized right after they're freed, so it won't be possible to access stale data by using a dangling pointer. As suggested by Michal Hocko, right now we don't let the heap users to disable initialization for certain allocations. There's not enough evidence that doing so can speed up real-life cases, and introducing ways to opt-out may result in things going out of control. This patch (of 2): The new options are needed to prevent possible information leaks and make control-flow bugs that depend on uninitialized values more deterministic. This is expected to be on-by-default on Android and Chrome OS. And it gives the opportunity for anyone else to use it under distros too via the boot args. (The init_on_free feature is regularly requested by folks where memory forensics is included in their threat models.) init_on_alloc=1 makes the kernel initialize newly allocated pages and heap objects with zeroes. Initialization is done at allocation time at the places where checks for __GFP_ZERO are performed. init_on_free=1 makes the kernel initialize freed pages and heap objects with zeroes upon their deletion. This helps to ensure sensitive data doesn't leak via use-after-free accesses. Both init_on_alloc=1 and init_on_free=1 guarantee that the allocator returns zeroed memory. The two exceptions are slab caches with constructors and SLAB_TYPESAFE_BY_RCU flag. Those are never zero-initialized to preserve their semantics. Both init_on_alloc and init_on_free default to zero, but those defaults can be overridden with CONFIG_INIT_ON_ALLOC_DEFAULT_ON and CONFIG_INIT_ON_FREE_DEFAULT_ON. If either SLUB poisoning or page poisoning is enabled, those options take precedence over init_on_alloc and init_on_free: initialization is only applied to unpoisoned allocations. Slowdown for the new features compared to init_on_free=0, init_on_alloc=0: hackbench, init_on_free=1: +7.62% sys time (st.err 0.74%) hackbench, init_on_alloc=1: +7.75% sys time (st.err 2.14%) Linux build with -j12, init_on_free=1: +8.38% wall time (st.err 0.39%) Linux build with -j12, init_on_free=1: +24.42% sys time (st.err 0.52%) Linux build with -j12, init_on_alloc=1: -0.13% wall time (st.err 0.42%) Linux build with -j12, init_on_alloc=1: +0.57% sys time (st.err 0.40%) The slowdown for init_on_free=0, init_on_alloc=0 compared to the baseline is within the standard error. The new features are also going to pave the way for hardware memory tagging (e.g. arm64's MTE), which will require both on_alloc and on_free hooks to set the tags for heap objects. With MTE, tagging will have the same cost as memory initialization. Although init_on_free is rather costly, there are paranoid use-cases where in-memory data lifetime is desired to be minimized. There are various arguments for/against the realism of the associated threat models, but given that we'll need the infrastructure for MTE anyway, and there are people who want wipe-on-free behavior no matter what the performance cost, it seems reasonable to include it in this series. [[email protected]: v8] Link: http://lkml.kernel.org/r/[email protected] [[email protected]: v9] Link: http://lkml.kernel.org/r/[email protected] [[email protected]: v10] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Alexander Potapenko <[email protected]> Acked-by: Kees Cook <[email protected]> Acked-by: Michal Hocko <[email protected]> [page and dmapool parts Acked-by: James Morris <[email protected]>] Cc: Christoph Lameter <[email protected]> Cc: Masahiro Yamada <[email protected]> Cc: "Serge E. Hallyn" <[email protected]> Cc: Nick Desaulniers <[email protected]> Cc: Kostya Serebryany <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Sandeep Patil <[email protected]> Cc: Laura Abbott <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Jann Horn <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Marco Elver <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-07-12	mm/large system hash: clear hashdist when only one node with memory is booted	Nicholas Piggin	1	-13/+18
	CONFIG_NUMA on 64-bit CPUs currently enables hashdist unconditionally even when booting on single node machines. This causes the large system hashes to be allocated with vmalloc, and mapped with small pages. This change clears hashdist if only one node has come up with memory. This results in the important large inode and dentry hashes using memblock allocations. All others are within 4MB size up to about 128GB of RAM, which allows them to be allocated from the linear map on most non-NUMA images. Other big hashes like futex and TCP should eventually be moved over to the same style of allocation as those vfs caches that use HASH_EARLY if !hashdist, so they don't exceed MAX_ORDER on very large non-NUMA images. This brings dTLB misses for linux kernel tree `git diff` from ~45,000 to ~8,000 on a Kaby Lake KVM guest with 8MB dentry hash and mitigations=off (performance is in the noise, under 1% difference, page tables are likely to be well cached for this workload). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Nicholas Piggin <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-07-12	mm/large system hash: use vmalloc for size > MAX_ORDER when !hashdist	Nicholas Piggin	1	-7/+9
	The kernel currently clamps large system hashes to MAX_ORDER when hashdist is not set, which is rather arbitrary. vmalloc space is limited on 32-bit machines, but this shouldn't result in much more used because of small physical memory limiting system hash sizes. Include "vmalloc" or "linear" in the kernel log message. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Nicholas Piggin <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-07-12	mm/vmalloc.c: spelling> s/informaion/information/	Geert Uytterhoeven	1	-2/+2
	Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Geert Uytterhoeven <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Acked-by: Souptick Joarder <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-07-12	mm/vmalloc.c: switch to WARN_ON() and move it under unlink_va()	Uladzislau Rezki (Sony)	1	-15/+10
	Trigger a warning if an object that is about to be freed is detached. We used to have a BUG_ON(), but even though it is considered as faulty behaviour that is not a good reason to break a system. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Uladzislau Rezki (Sony) <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Oleksiy Avramchenko <[email protected]> Cc: Steven Rostedt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-07-12	mm/vmalloc.c: get rid of one single unlink_va() when merge	Uladzislau Rezki (Sony)	1	-6/+2
	It does not make sense to try to "unlink" the node that is definitely not linked with a list nor tree. On the first merge step VA just points to the previously disconnected busy area. On the second step, check if the node has been merged and do "unlink" if so, because now it points to an object that must be linked. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Uladzislau Rezki (Sony) <[email protected]> Acked-by: Hillf Danton <[email protected]> Reviewed-by: Roman Gushchin <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Oleksiy Avramchenko <[email protected]> Cc: Steven Rostedt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-07-12	mm/vmalloc.c: preload a CPU with one object for split purpose	Uladzislau Rezki (Sony)	1	-4/+51
	Refactor the NE_FIT_TYPE split case when it comes to an allocation of one extra object. We need it in order to build a remaining space. The preload is done per CPU in non-atomic context with GFP_KERNEL flags. More permissive parameters can be beneficial for systems which are suffer from high memory pressure or low memory condition. For example on my KVM system(4xCPUs, no swap, 256MB RAM) i can simulate the failure of page allocation with GFP_NOWAIT flags. Using "stress-ng" tool and starting N workers spinning on fork() and exit(), i can trigger below trace: <snip> [ 179.815161] stress-ng-fork: page allocation failure: order:0, mode:0x40800(GFP_NOWAIT\|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0 [ 179.815168] CPU: 0 PID: 12612 Comm: stress-ng-fork Not tainted 5.2.0-rc3+ #1003 [ 179.815170] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 [ 179.815171] Call Trace: [ 179.815178] dump_stack+0x5c/0x7b [ 179.815182] warn_alloc+0x108/0x190 [ 179.815187] __alloc_pages_slowpath+0xdc7/0xdf0 [ 179.815191] __alloc_pages_nodemask+0x2de/0x330 [ 179.815194] cache_grow_begin+0x77/0x420 [ 179.815197] fallback_alloc+0x161/0x200 [ 179.815200] kmem_cache_alloc+0x1c9/0x570 [ 179.815202] alloc_vmap_area+0x32c/0x990 [ 179.815206] __get_vm_area_node+0xb0/0x170 [ 179.815208] __vmalloc_node_range+0x6d/0x230 [ 179.815211] ? _do_fork+0xce/0x3d0 [ 179.815213] copy_process.part.46+0x850/0x1b90 [ 179.815215] ? _do_fork+0xce/0x3d0 [ 179.815219] _do_fork+0xce/0x3d0 [ 179.815226] ? __do_page_fault+0x2bf/0x4e0 [ 179.815229] do_syscall_64+0x55/0x130 [ 179.815231] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 179.815234] RIP: 0033:0x7fedec4c738b ... [ 179.815237] RSP: 002b:00007ffda469d730 EFLAGS: 00000246 ORIG_RAX: 0000000000000038 [ 179.815239] RAX: ffffffffffffffda RBX: 00007ffda469d730 RCX: 00007fedec4c738b [ 179.815240] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011 [ 179.815241] RBP: 00007ffda469d780 R08: 00007fededd6e300 R09: 00007ffda47f50a0 [ 179.815242] R10: 00007fededd6e5d0 R11: 0000000000000246 R12: 0000000000000000 [ 179.815243] R13: 0000000000000020 R14: 0000000000000000 R15: 0000000000000000 [ 179.815245] Mem-Info: [ 179.815249] active_anon:12686 inactive_anon:14760 isolated_anon:0 active_file:502 inactive_file:61 isolated_file:70 unevictable:2 dirty:0 writeback:0 unstable:0 slab_reclaimable:2380 slab_unreclaimable:7520 mapped:15069 shmem:14813 pagetables:10833 bounce:0 free:1922 free_pcp:229 free_cma:0 <snip> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Uladzislau Rezki (Sony) <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Oleksiy Avramchenko <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Steven Rostedt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-07-12	mm/vmalloc.c: remove "node" argument	Uladzislau Rezki (Sony)	1	-2/+2
	Patch series "Some cleanups for the KVA/vmalloc", v5. This patch (of 4): Remove unused argument from the __alloc_vmap_area() function. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Uladzislau Rezki (Sony) <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Reviewed-by: Roman Gushchin <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Oleksiy Avramchenko <[email protected]> Cc: Steven Rostedt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-07-12	mm/mmu_notifier: use hlist_add_head_rcu()	Jean-Philippe Brucker	1	-1/+1
	Make mmu_notifier_register() safer by issuing a memory barrier before registering a new notifier. This fixes a theoretical bug on weakly ordered CPUs. For example, take this simplified use of notifiers by a driver: my_struct->mn.ops = &my_ops; /* (1) / mmu_notifier_register(&my_struct->mn, mm) ... hlist_add_head(&mn->hlist, &mm->mmu_notifiers); / (2) */ ... Once mmu_notifier_register() releases the mm locks, another thread can invalidate a range: mmu_notifier_invalidate_range() ... hlist_for_each_entry_rcu(mn, &mm->mmu_notifiers, hlist) { if (mn->ops->invalidate_range) The read side relies on the data dependency between mn and ops to ensure that the pointer is properly initialized. But the write side doesn't have any dependency between (1) and (2), so they could be reordered and the readers could dereference an invalid mn->ops. mmu_notifier_register() does take all the mm locks before adding to the hlist, but those have acquire semantics which isn't sufficient. By calling hlist_add_head_rcu() instead of hlist_add_head() we update the hlist using a store-release, ensuring that readers see prior initialization of my_struct. This situation is better illustated by litmus test MP+onceassign+derefonce. Link: http://lkml.kernel.org/r/[email protected] Fixes: cddb8a5c14aa ("mmu-notifiers: core") Signed-off-by: Jean-Philippe Brucker <[email protected]> Cc: Jérôme Glisse <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-07-12	mm/memory.c: fail when offset == num in first check of __vm_map_pages()	Miguel Ojeda	1	-1/+1
	If the caller asks us for offset == num, we should already fail in the first check, i.e. the one testing for offsets beyond the object. At the moment, we are failing on the second test anyway, since count cannot be 0. Still, to agree with the comment of the first test, we should first test it there. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Miguel Ojeda <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Souptick Joarder <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: Huang Ying <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-07-12	mm/pgtable: drop pgtable_t variable from pte_fn_t functions	Anshuman Khandual	2	-5/+2
	Drop the pgtable_t variable from all implementation for pte_fn_t as none of them use it. apply_to_pte_range() should stop computing it as well. Should help us save some cycles. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Anshuman Khandual <[email protected]> Acked-by: Matthew Wilcox <[email protected]> Cc: Ard Biesheuvel <[email protected]> Cc: Russell King <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Will Deacon <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Logan Gunthorpe <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Dan Williams <[email protected]> Cc: <[email protected]> Cc: Mike Rapoport <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-07-12	mm/gup.c: mark undo_dev_pagemap as __maybe_unused	Guenter Roeck	1	-1/+2
	Several mips builds generate the following build warning. mm/gup.c:1788:13: warning: 'undo_dev_pagemap' defined but not used The function is declared unconditionally but only called from behind various ifdefs. Mark it __maybe_unused. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Guenter Roeck <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Stephen Rothwell <[email protected]> Cc: Robin Murphy <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-07-12	mm/gup.c: remove some BUG_ONs from get_gate_page()	Andy Lutomirski	1	-3/+6
	If we end up without a PGD or PUD entry backing the gate area, don't BUG -- just fail gracefully. It's not entirely implausible that this could happen some day on x86. It doesn't right now even with an execute-only emulated vsyscall page because the fixmap shares the PUD, but the core mm code shouldn't rely on that particular detail to avoid OOPSing. Link: http://lkml.kernel.org/r/a1d9f4efb75b9d464e59fd6af00104b21c58f6f7.1561610798.git.luto@kernel.org Signed-off-by: Andy Lutomirski <[email protected]> Reviewed-by: Kees Cook <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Florian Weimer <[email protected]> Cc: Jann Horn <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-07-12	mm/gup: speed up check_and_migrate_cma_pages() on huge page	Pingfan Liu	1	-8/+16
	Both hugetlb and thp locate on the same migration type of pageblock, since they are allocated from a free_list[]. Based on this fact, it is enough to check on a single subpage to decide the migration type of the whole huge page. By this way, it saves (2M/4K - 1) times loop for pmd_huge on x86, similar on other archs. Furthermore, when executing isolate_huge_page(), it avoid taking global hugetlb_lock many times, and meanless remove/add to the local link list cma_page_list. [[email protected]: make `i' and `step' unsigned] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Pingfan Liu <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Reviewed-by: Ira Weiny <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: John Hubbard <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Keith Busch <[email protected]> Cc: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-07-12	mm: mark the page referenced in gup_hugepte	Christoph Hellwig	1	-0/+1
	All other get_user_page_fast cases mark the page referenced, so do this here as well. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Cc: Andrey Konovalov <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: David Miller <[email protected]> Cc: James Hogan <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Khalid Aziz <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Paul Burton <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Rich Felker <[email protected]> Cc: Yoshinori Sato <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-07-12	mm: switch gup_hugepte to use try_get_compound_head	Christoph Hellwig	1	-1/+2
	This applies the overflow fixes from 8fde12ca79aff ("mm: prevent get_user_pages() from overflowing page refcount") to the powerpc hugepd code and brings it back in sync with the other GUP cases. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Cc: Andrey Konovalov <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: David Miller <[email protected]> Cc: James Hogan <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Khalid Aziz <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Paul Burton <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Rich Felker <[email protected]> Cc: Yoshinori Sato <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-07-12	mm: move the powerpc hugepd code to mm/gup.c	Christoph Hellwig	2	-0/+92
	While only powerpc supports the hugepd case, the code is pretty generic and I'd like to keep all GUP internals in one place. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Cc: Andrey Konovalov <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: David Miller <[email protected]> Cc: James Hogan <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Khalid Aziz <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Paul Burton <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Rich Felker <[email protected]> Cc: Yoshinori Sato <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-07-12	mm: validate get_user_pages_fast flags	Christoph Hellwig	1	-0/+3
	We can only deal with FOLL_WRITE and/or FOLL_LONGTERM in get_user_pages_fast, so reject all other flags. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Cc: Andrey Konovalov <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: David Miller <[email protected]> Cc: James Hogan <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Khalid Aziz <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Paul Burton <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Rich Felker <[email protected]> Cc: Yoshinori Sato <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-07-12	mm: consolidate the get_user_pages* implementations	Christoph Hellwig	5	-142/+65
	Always build mm/gup.c so that we don't have to provide separate nommu stubs. Also merge the get_user_pages_fast and __get_user_pages_fast stubs when HAVE_FAST_GUP into the main implementations, which will never call the fast path if HAVE_FAST_GUP is not set. This also ensures the new put_user_pages* helpers are available for nommu, as those are currently missing, which would create a problem as soon as we actually grew users for it. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Cc: Andrey Konovalov <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: David Miller <[email protected]> Cc: James Hogan <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Khalid Aziz <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Paul Burton <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Rich Felker <[email protected]> Cc: Yoshinori Sato <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-07-12	mm: reorder code blocks in gup.c	Christoph Hellwig	1	-205/+205
	This moves the actually exported functions towards the end of the file, and reorders some functions to be in more logical blocks as a preparation for moving various stubs inline into the main functionality using IS_ENABLED(). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Cc: Andrey Konovalov <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: David Miller <[email protected]> Cc: James Hogan <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Khalid Aziz <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Paul Burton <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Rich Felker <[email protected]> Cc: Yoshinori Sato <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-07-12	mm: rename CONFIG_HAVE_GENERIC_GUP to CONFIG_HAVE_FAST_GUP	Christoph Hellwig	2	-3/+3
	We only support the generic GUP now, so rename the config option to be more clear, and always use the mm/Kconfig definition of the symbol and select it from the arch Kconfigs. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Khalid Aziz <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Cc: Andrey Konovalov <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: David Miller <[email protected]> Cc: James Hogan <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Paul Burton <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Rich Felker <[email protected]> Cc: Yoshinori Sato <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-07-12	mm: lift the x86_32 PAE version of gup_get_pte to common code	Christoph Hellwig	2	-4/+50
	The split low/high access is the only non-READ_ONCE version of gup_get_pte that did show up in the various arch implemenations. Lift it to common code and drop the ifdef based arch override. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Cc: Andrey Konovalov <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: David Miller <[email protected]> Cc: James Hogan <[email protected]> Cc: Khalid Aziz <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Paul Burton <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Rich Felker <[email protected]> Cc: Yoshinori Sato <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-07-12	mm: simplify gup_fast_permitted	Christoph Hellwig	1	-10/+7
	Pass in the already calculated end value instead of recomputing it, and leave the end > start check in the callers instead of duplicating them in the arch code. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Cc: Andrey Konovalov <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: David Miller <[email protected]> Cc: James Hogan <[email protected]> Cc: Khalid Aziz <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Paul Burton <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Rich Felker <[email protected]> Cc: Yoshinori Sato <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-07-12	mm: use untagged_addr() for get_user_pages_fast addresses	Christoph Hellwig	1	-2/+2
	Patch series "switch the remaining architectures to use generic GUP", v4. A series to switch mips, sh and sparc64 to use the generic GUP code so that we only have one codebase to touch for further improvements to this code. This patch (of 16): This will allow sparc64, or any future architecture with memory tagging to override its tags for get_user_pages and get_user_pages_fast. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Khalid Aziz <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Cc: Paul Burton <[email protected]> Cc: James Hogan <[email protected]> Cc: Yoshinori Sato <[email protected]> Cc: Rich Felker <[email protected]> Cc: David Miller <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Khalid Aziz <[email protected]> Cc: Andrey Konovalov <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Ralf Baechle <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-07-12	mm, memcg: add a memcg_slabinfo debugfs file	Waiman Long	1	-0/+60
	There are concerns about memory leaks from extensive use of memory cgroups as each memory cgroup creates its own set of kmem caches. There is a possiblity that the memcg kmem caches may remain even after the memory cgroups have been offlined. Therefore, it will be useful to show the status of each of memcg kmem caches. This patch introduces a new <debugfs>/memcg_slabinfo file which is somewhat similar to /proc/slabinfo in format, but lists only information about kmem caches that have child memcg kmem caches. Information available in /proc/slabinfo are not repeated in memcg_slabinfo. A portion of a sample output of the file was: # <name> <css_id[:dead]> <active_objs> <num_objs> <active_slabs> <num_slabs> rpc_inode_cache root 13 51 1 1 rpc_inode_cache 48 0 0 0 0 fat_inode_cache root 1 45 1 1 fat_inode_cache 41 2 45 1 1 xfs_inode root 770 816 24 24 xfs_inode 92 22 34 1 1 xfs_inode 88:dead 1 34 1 1 xfs_inode 89:dead 23 34 1 1 xfs_inode 85 4 34 1 1 xfs_inode 84 9 34 1 1 The css id of the memcg is also listed. If a memcg is not online, the tag ":dead" will be attached as shown above. [[email protected]: memcg: add ":deact" tag for reparented kmem caches in memcg_slabinfo] Link: http://lkml.kernel.org/r/[email protected] [[email protected]: set the flag in the common code as suggested by Roman] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Waiman Long <[email protected]> Suggested-by: Shakeel Butt <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Acked-by: Roman Gushchin <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Vladimir Davydov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-07-12	mm: memcg/slab: reparent memcg kmem_caches on cgroup removal	Roman Gushchin	3	-17/+57
	Let's reparent non-root kmem_caches on memcg offlining. This allows us to release the memory cgroup without waiting for the last outstanding kernel object (e.g. dentry used by another application). Since the parent cgroup is already charged, everything we need to do is to splice the list of kmem_caches to the parent's kmem_caches list, swap the memcg pointer, drop the css refcounter for each kmem_cache and adjust the parent's css refcounter. Please, note that kmem_cache->memcg_params.memcg isn't a stable pointer anymore. It's safe to read it under rcu_read_lock(), cgroup_mutex held, or any other way that protects the memory cgroup from being released. We can race with the slab allocation and deallocation paths. It's not a big problem: parent's charge and slab global stats are always correct, and we don't care anymore about the child usage and global stats. The child cgroup is already offline, so we don't use or show it anywhere. Local slab stats (NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE) aren't used anywhere except count_shadow_nodes(). But even there it won't break anything: after reparenting "nodes" will be 0 on child level (because we're already reparenting shrinker lists), and on parent level page stats always were 0, and this patch won't change anything. [[email protected]: properly handle kmem_caches reparented to root_mem_cgroup] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Roman Gushchin <[email protected]> Acked-by: Vladimir Davydov <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Waiman Long <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Andrei Vagin <[email protected]> Cc: Qian Cai <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-07-12	mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages	Roman Gushchin	3	-19/+70
	Every slab page charged to a non-root memory cgroup has a pointer to the memory cgroup and holds a reference to it, which protects a non-empty memory cgroup from being released. At the same time the page has a pointer to the corresponding kmem_cache, and also hold a reference to the kmem_cache. And kmem_cache by itself holds a reference to the cgroup. So there is clearly some redundancy, which allows to stop setting the page->mem_cgroup pointer and rely on getting memcg pointer indirectly via kmem_cache. Further it will allow to change this pointer easier, without a need to go over all charged pages. So let's stop setting page->mem_cgroup pointer for slab pages, and stop using the css refcounter directly for protecting the memory cgroup from going away. Instead rely on kmem_cache as an intermediate object. Make sure that vmstats and shrinker lists are working as previously, as well as /proc/kpagecgroup interface. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Roman Gushchin <[email protected]> Acked-by: Vladimir Davydov <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Waiman Long <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Andrei Vagin <[email protected]> Cc: Qian Cai <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-07-12	mm: memcg/slab: rework non-root kmem_cache lifecycle management	Roman Gushchin	3	-78/+94
	Currently each charged slab page holds a reference to the cgroup to which it's charged. Kmem_caches are held by the memcg and are released all together with the memory cgroup. It means that none of kmem_caches are released unless at least one reference to the memcg exists, which is very far from optimal. Let's rework it in a way that allows releasing individual kmem_caches as soon as the cgroup is offline, the kmem_cache is empty and there are no pending allocations. To make it possible, let's introduce a new percpu refcounter for non-root kmem caches. The counter is initialized to the percpu mode, and is switched to the atomic mode during kmem_cache deactivation. The counter is bumped for every charged page and also for every running allocation. So the kmem_cache can't be released unless all allocations complete. To shutdown non-active empty kmem_caches, let's reuse the work queue, previously used for the kmem_cache deactivation. Once the reference counter reaches 0, let's schedule an asynchronous kmem_cache release. * I used the following simple approach to test the performance (stolen from another patchset by T. Harding): time find / -name fname-no-exist echo 2 > /proc/sys/vm/drop_caches repeat 10 times Results: orig patched real 0m1.455s real 0m1.355s user 0m0.206s user 0m0.219s sys 0m0.855s sys 0m0.807s real 0m1.487s real 0m1.699s user 0m0.221s user 0m0.256s sys 0m0.806s sys 0m0.948s real 0m1.515s real 0m1.505s user 0m0.183s user 0m0.215s sys 0m0.876s sys 0m0.858s real 0m1.291s real 0m1.380s user 0m0.193s user 0m0.198s sys 0m0.843s sys 0m0.786s real 0m1.364s real 0m1.374s user 0m0.180s user 0m0.182s sys 0m0.868s sys 0m0.806s real 0m1.352s real 0m1.312s user 0m0.201s user 0m0.212s sys 0m0.820s sys 0m0.761s real 0m1.302s real 0m1.349s user 0m0.205s user 0m0.203s sys 0m0.803s sys 0m0.792s real 0m1.334s real 0m1.301s user 0m0.194s user 0m0.201s sys 0m0.806s sys 0m0.779s real 0m1.426s real 0m1.434s user 0m0.216s user 0m0.181s sys 0m0.824s sys 0m0.864s real 0m1.350s real 0m1.295s user 0m0.200s user 0m0.190s sys 0m0.842s sys 0m0.811s So it looks like the difference is not noticeable in this test. [[email protected]: fix an use-after-free in kmemcg_workfn()] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Roman Gushchin <[email protected]> Signed-off-by: Qian Cai <[email protected]> Acked-by: Vladimir Davydov <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Waiman Long <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Andrei Vagin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-07-12	mm: memcg/slab: synchronize access to kmem_cache dying flag using a spinlock	Roman Gushchin	1	-3/+12
	Currently the memcg_params.dying flag and the corresponding workqueue used for the asynchronous deactivation of kmem_caches is synchronized using the slab_mutex. It makes impossible to check this flag from the irq context, which will be required in order to implement asynchronous release of kmem_caches. So let's switch over to the irq-save flavor of the spinlock-based synchronization. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Roman Gushchin <[email protected]> Acked-by: Vladimir Davydov <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Waiman Long <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Andrei Vagin <[email protected]> Cc: Qian Cai <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-07-12	mm: memcg/slab: don't check the dying flag on kmem_cache creation	Roman Gushchin	1	-1/+1
	There is no point in checking the root_cache->memcg_params.dying flag on kmem_cache creation path. New allocations shouldn't be performed using a dead root kmem_cache, so no new memcg kmem_cache creation can be scheduled after the flag is set. And if it was scheduled before, flush_memcg_workqueue() will wait for it anyway. So let's drop this check to simplify the code. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Roman Gushchin <[email protected]> Acked-by: Vladimir Davydov <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Waiman Long <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Andrei Vagin <[email protected]> Cc: Qian Cai <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-07-12	mm: memcg/slab: unify SLAB and SLUB page accounting	Roman Gushchin	3	-28/+30
	Currently the page accounting code is duplicated in SLAB and SLUB internals. Let's move it into new (un)charge_slab_page helpers in the slab_common.c file. These helpers will be responsible for statistics (global and memcg-aware) and memcg charging. So they are replacing direct memcg_(un)charge_slab() calls. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Roman Gushchin <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Acked-by: Christoph Lameter <[email protected]> Acked-by: Vladimir Davydov <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Waiman Long <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Andrei Vagin <[email protected]> Cc: Qian Cai <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-07-12	mm: memcg/slab: introduce __memcg_kmem_uncharge_memcg()	Roman Gushchin	1	-8/+17
	Let's separate the page counter modification code out of __memcg_kmem_uncharge() in a way similar to what __memcg_kmem_charge() and __memcg_kmem_charge_memcg() work. This will allow to reuse this code later using a new memcg_kmem_uncharge_memcg() wrapper, which calls __memcg_kmem_uncharge_memcg() if memcg_kmem_enabled() check is passed. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Roman Gushchin <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Acked-by: Vladimir Davydov <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Waiman Long <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Andrei Vagin <[email protected]> Cc: Qian Cai <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-07-12	mm: memcg/slab: generalize postponed non-root kmem_cache deactivation	Roman Gushchin	4	-28/+14
	Currently SLUB uses a work scheduled after an RCU grace period to deactivate a non-root kmem_cache. This mechanism can be reused for kmem_caches release, but requires generalization for SLAB case. Introduce kmemcg_cache_deactivate() function, which calls allocator-specific __kmem_cache_deactivate() and schedules execution of __kmem_cache_deactivate_after_rcu() with all necessary locks in a worker context after an rcu grace period. Here is the new calling scheme: kmemcg_cache_deactivate() __kmemcg_cache_deactivate() SLAB/SLUB-specific kmemcg_rcufn() rcu kmemcg_workfn() work __kmemcg_cache_deactivate_after_rcu() SLAB/SLUB-specific instead of: __kmemcg_cache_deactivate() SLAB/SLUB-specific slab_deactivate_memcg_cache_rcu_sched() SLUB-only kmemcg_rcufn() rcu kmemcg_workfn() work kmemcg_cache_deact_after_rcu() SLUB-only For consistency, all allocator-specific functions start with "__". Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Roman Gushchin <[email protected]> Acked-by: Vladimir Davydov <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Waiman Long <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Andrei Vagin <[email protected]> Cc: Qian Cai <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-07-12	mm: memcg/slab: rename slab delayed deactivation functions and fields	Roman Gushchin	2	-16/+16
	The delayed work/rcu deactivation infrastructure of non-root kmem_caches can be also used for asynchronous release of these objects. Let's get rid of the word "deactivation" in corresponding names to make the code look better after generalization. It's easier to make the renaming first, so that the generalized code will look consistent from scratch. Let's rename struct memcg_cache_params fields: deact_fn -> work_fn deact_rcu_head -> rcu_head deact_work -> work And RCU/delayed work callbacks in slab common code: kmemcg_deactivate_rcufn -> kmemcg_rcufn kmemcg_deactivate_workfn -> kmemcg_workfn This patch contains no functional changes, only renamings. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Roman Gushchin <[email protected]> Acked-by: Vladimir Davydov <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Waiman Long <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Andrei Vagin <[email protected]> Cc: Qian Cai <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-07-12	mm: memcg/slab: postpone kmem_cache memcg pointer initialization to ↵	Roman Gushchin	4	-11/+12
	memcg_link_cache() Patch series "mm: reparent slab memory on cgroup removal", v7. # Why do we need this? We've noticed that the number of dying cgroups is steadily growing on most of our hosts in production. The following investigation revealed an issue in the userspace memory reclaim code [1], accounting of kernel stacks [2], and also the main reason: slab objects. The underlying problem is quite simple: any page charged to a cgroup holds a reference to it, so the cgroup can't be reclaimed unless all charged pages are gone. If a slab object is actively used by other cgroups, it won't be reclaimed, and will prevent the origin cgroup from being reclaimed. Slab objects, and first of all vfs cache, is shared between cgroups, which are using the same underlying fs, and what's even more important, it's shared between multiple generations of the same workload. So if something is running periodically every time in a new cgroup (like how systemd works), we do accumulate multiple dying cgroups. Strictly speaking pagecache isn't different here, but there is a key difference: we disable protection and apply some extra pressure on LRUs of dying cgroups, and these LRUs contain all charged pages. My experiments show that with the disabled kernel memory accounting the number of dying cgroups stabilizes at a relatively small number (~100, depends on memory pressure and cgroup creation rate), and with kernel memory accounting it grows pretty steadily up to several thousands. Memory cgroups are quite complex and big objects (mostly due to percpu stats), so it leads to noticeable memory losses. Memory occupied by dying cgroups is measured in hundreds of megabytes. I've even seen a host with more than 100Gb of memory wasted for dying cgroups. It leads to a degradation of performance with the uptime, and generally limits the usage of cgroups. My previous attempt [3] to fix the problem by applying extra pressure on slab shrinker lists caused a regressions with xfs and ext4, and has been reverted [4]. The following attempts to find the right balance [5, 6] were not successful. So instead of trying to find a maybe non-existing balance, let's do reparent accounted slab caches to the parent cgroup on cgroup removal. # Implementation approach There is however a significant problem with reparenting of slab memory: there is no list of charged pages. Some of them are in shrinker lists, but not all. Introducing of a new list is really not an option. But fortunately there is a way forward: every slab page has a stable pointer to the corresponding kmem_cache. So the idea is to reparent kmem_caches instead of slab pages. It's actually simpler and cheaper, but requires some underlying changes: 1) Make kmem_caches to hold a single reference to the memory cgroup, instead of a separate reference per every slab page. 2) Stop setting page->mem_cgroup pointer for memcg slab pages and use page->kmem_cache->memcg indirection instead. It's used only on slab page release, so performance overhead shouldn't be a big issue. 3) Introduce a refcounter for non-root slab caches. It's required to be able to destroy kmem_caches when they become empty and release the associated memory cgroup. There is a bonus: currently we release all memcg kmem_caches all together with the memory cgroup itself. This patchset allows individual kmem_caches to be released as soon as they become inactive and free. Some additional implementation details are provided in corresponding commit messages. # Results Below is the average number of dying cgroups on two groups of our production hosts. They do run some sort of web frontend workload, the memory pressure is moderate. As we can see, with the kernel memory reparenting the number stabilizes in 60s range; however with the original version it grows almost linearly and doesn't show any signs of plateauing. The difference in slab and percpu usage between patched and unpatched versions also grows linearly. In 7 days it exceeded 200Mb. day 0 1 2 3 4 5 6 7 original 56 362 628 752 1070 1250 1490 1560 patched 23 46 51 55 60 57 67 69 mem diff(Mb) 22 74 123 152 164 182 214 241 # Links [1]: commit 68600f623d69 ("mm: don't miss the last page because of round-off error") [2]: commit 9b6f7e163cd0 ("mm: rework memcg kernel stack accounting") [3]: commit 172b06c32b94 ("mm: slowly shrink slabs with a relatively small number of objects") [4]: commit a9a238e83fbb ("Revert "mm: slowly shrink slabs with a relatively small number of objects") [5]: https://lkml.org/lkml/2019/1/28/1865 [6]: https://marc.info/?l=linux-mm&m=155064763626437&w=2 This patch (of 10): Initialize kmem_cache->memcg_params.memcg pointer in memcg_link_cache() rather than in init_memcg_params(). Once kmem_cache will hold a reference to the memory cgroup, it will simplify the refcounting. For non-root kmem_caches memcg_link_cache() is always called before the kmem_cache becomes visible to a user, so it's safe. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Roman Gushchin <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Acked-by: Vladimir Davydov <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Waiman Long <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Andrei Vagin <[email protected]> Cc: Qian Cai <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-07-12	mm: memcontrol: dump memory.stat during cgroup OOM	Johannes Weiner	1	-132/+157
	The current cgroup OOM memory info dump doesn't include all the memory we are tracking, nor does it give insight into what the VM tried to do leading up to the OOM. All that useful info is in memory.stat. Furthermore, the recursive printing for every child cgroup can generate absurd amounts of data on the console for larger cgroup trees, and it's not like we provide a per-cgroup breakdown during global OOM kills. When an OOM kill is triggered, print one set of recursive memory.stat items at the level whose limit triggered the OOM condition. Example output: stress invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 CPU: 2 PID: 210 Comm: stress Not tainted 5.2.0-rc2-mm1-00247-g47d49835983c #135 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-20181126_142135-anatol 04/01/2014 Call Trace: dump_stack+0x46/0x60 dump_header+0x4c/0x2d0 oom_kill_process.cold.10+0xb/0x10 out_of_memory+0x200/0x270 ? try_to_free_mem_cgroup_pages+0xdf/0x130 mem_cgroup_out_of_memory+0xb7/0xc0 try_charge+0x680/0x6f0 mem_cgroup_try_charge+0xb5/0x160 __add_to_page_cache_locked+0xc6/0x300 ? list_lru_destroy+0x80/0x80 add_to_page_cache_lru+0x45/0xc0 pagecache_get_page+0x11b/0x290 filemap_fault+0x458/0x6d0 ext4_filemap_fault+0x27/0x36 __do_fault+0x2f/0xb0 __handle_mm_fault+0x9c5/0x1140 ? apic_timer_interrupt+0xa/0x20 handle_mm_fault+0xc5/0x180 __do_page_fault+0x1ab/0x440 ? page_fault+0x8/0x30 page_fault+0x1e/0x30 RIP: 0033:0x55c32167fc10 Code: Bad RIP value. RSP: 002b:00007fff1d031c50 EFLAGS: 00010206 RAX: 000000000dc00000 RBX: 00007fd2db000010 RCX: 00007fd2db000010 RDX: 0000000000000000 RSI: 0000000010001000 RDI: 0000000000000000 RBP: 000055c321680a54 R08: 00000000ffffffff R09: 0000000000000000 R10: 0000000000000022 R11: 0000000000000246 R12: ffffffffffffffff R13: 0000000000000002 R14: 0000000000001000 R15: 0000000010000000 memory: usage 1024kB, limit 1024kB, failcnt 75131 swap: usage 0kB, limit 9007199254740988kB, failcnt 0 Memory cgroup stats for /foo: anon 0 file 0 kernel_stack 36864 slab 274432 sock 0 shmem 0 file_mapped 0 file_dirty 0 file_writeback 0 anon_thp 0 inactive_anon 126976 active_anon 0 inactive_file 0 active_file 0 unevictable 0 slab_reclaimable 0 slab_unreclaimable 274432 pgfault 59466 pgmajfault 1617 workingset_refault 2145 workingset_activate 0 workingset_nodereclaim 0 pgrefill 98952 pgscan 200060 pgsteal 59340 pgactivate 40095 pgdeactivate 96787 pglazyfree 0 pglazyfreed 0 thp_fault_alloc 0 thp_collapse_alloc 0 Tasks state (memory values in pages): [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name [ 200] 0 200 1121 884 53248 29 0 bash [ 209] 0 209 905 246 45056 19 0 stress [ 210] 0 210 66442 56 499712 56349 0 stress oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),oom_memcg=/foo,task_memcg=/foo,task=stress,pid=210,uid=0 Memory cgroup out of memory: Killed process 210 (stress) total-vm:265768kB, anon-rss:0kB, file-rss:224kB, shmem-rss:0kB oom_reaper: reaped process 210 (stress), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [[email protected]: s/kvmalloc/kmalloc/ per Michal] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Vladimir Davydov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-07-12	mm, memcg: introduce memory.events.local	Shakeel Butt	1	-10/+24
	The memory controller in cgroup v2 exposes memory.events file for each memcg which shows the number of times events like low, high, max, oom and oom_kill have happened for the whole tree rooted at that memcg. Users can also poll or register notification to monitor the changes in that file. Any event at any level of the tree rooted at memcg will notify all the listeners along the path till root_mem_cgroup. There are existing users which depend on this behavior. However there are users which are only interested in the events happening at a specific level of the memcg tree and not in the events in the underlying tree rooted at that memcg. One such use-case is a centralized resource monitor which can dynamically adjust the limits of the jobs running on a system. The jobs can create their sub-hierarchy for their own sub-tasks. The centralized monitor is only interested in the events at the top level memcgs of the jobs as it can then act and adjust the limits of the jobs. Using the current memory.events for such centralized monitor is very inconvenient. The monitor will keep receiving events which it is not interested and to find if the received event is interesting, it has to read memory.event files of the next level and compare it with the top level one. So, let's introduce memory.events.local to the memcg which shows and notify for the events at the memcg level. Now, does memory.stat and memory.pressure need their local versions. IMHO no due to the no internal process contraint of the cgroup v2. The memory.stat file of the top level memcg of a job shows the stats and vmevents of the whole tree. The local stats or vmevents of the top level memcg will only change if there is a process running in that memcg but v2 does not allow that. Similarly for memory.pressure there will not be any process in the internal nodes and thus no chance of local pressure. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Shakeel Butt <[email protected]> Reviewed-by: Roman Gushchin <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Chris Down <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-07-12	memcg, oom: no oom-kill for __GFP_RETRY_MAYFAIL	Shakeel Butt	1	-3/+1
	The documentation of __GFP_RETRY_MAYFAIL clearly mentioned that the OOM killer will not be triggered and indeed the page alloc does not invoke OOM killer for such allocations. However we do trigger memcg OOM killer for __GFP_RETRY_MAYFAIL. Fix that. This flag will used later to not trigger oom-killer in the charging path for fanotify and inotify event allocations. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Shakeel Butt <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Amir Goldstein <[email protected]> Cc: Jan Kara <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Vladimir Davydov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-07-12	mm/mincore.c: fix race between swapoff and mincore	Huang Ying	1	-2/+10
	Via commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks"), after swapoff, the address_space associated with the swap device will be freed. So swap_address_space() users which touch the address_space need some kind of mechanism to prevent the address_space from being freed during accessing. When mincore processes an unmapped range for swapped shmem pages, it doesn't hold the lock to prevent swap device from being swapped off. So the following race is possible: CPU1 CPU2 do_mincore() swapoff() walk_page_range() mincore_unmapped_range() __mincore_unmapped_range mincore_page as = swap_address_space() ... exit_swap_address_space() ... kvfree(spaces) find_get_page(as) The address space may be accessed after being freed. To fix the race, get_swap_device()/put_swap_device() is used to enclose find_get_page() to check whether the swap entry is valid and prevent the swap device from being swapoff during accessing. Link: http://lkml.kernel.org/r/[email protected] Fixes: 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks") Signed-off-by: "Huang, Ying" <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Tim Chen <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Jérôme Glisse <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Yang Shi <[email protected]> Cc: David Rientjes <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Jan Kara <[email protected]> Cc: Dave Jiang <[email protected]> Cc: Daniel Jordan <[email protected]> Cc: Andrea Parri <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-07-12	mm, swap: use rbtree for swap_extent	Aaron Lu	2	-63/+76
	swap_extent is used to map swap page offset to backing device's block offset. For a continuous block range, one swap_extent is used and all these swap_extents are managed in a linked list. These swap_extents are used by map_swap_entry() during swap's read and write path. To find out the backing device's block offset for a page offset, the swap_extent list will be traversed linearly, with curr_swap_extent being used as a cache to speed up the search. This works well as long as swap_extents are not huge or when the number of processes that access swap device are few, but when the swap device has many extents and there are a number of processes accessing the swap device concurrently, it can be a problem. On one of our servers, the disk's remaining size is tight: $df -h Filesystem Size Used Avail Use% Mounted on ... ... /dev/nvme0n1p1 1.8T 1.3T 504G 72% /home/t4 When creating a 80G swapfile there, there are as many as 84656 swap extents. The end result is, kernel spends abou 30% time in map_swap_entry() and swap throughput is only 70MB/s. As a comparison, when I used smaller sized swapfile, like 4G whose swap_extent dropped to 2000, swap throughput is back to 400-500MB/s and map_swap_entry() is about 3%. One downside of using rbtree for swap_extent is, 'struct rbtree' takes 24 bytes while 'struct list_head' takes 16 bytes, that's 8 bytes more for each swap_extent. For a swapfile that has 80k swap_extents, that means 625KiB more memory consumed. Test: Since it's not possible to reboot that server, I can not test this patch diretly there. Instead, I tested it on another server with NVMe disk. I created a 20G swapfile on an NVMe backed XFS fs. By default, the filesystem is quite clean and the created swapfile has only 2 extents. Testing vanilla and this patch shows no obvious performance difference when swapfile is not fragmented. To see the patch's effects, I used some tweaks to manually fragment the swapfile by breaking the extent at 1M boundary. This made the swapfile have 20K extents. nr_task=4 kernel swapout(KB/s) map_swap_entry(perf) swapin(KB/s) map_swap_entry(perf) vanilla 165191 90.77% 171798 90.21% patched 858993 +420% 2.16% 715827 +317% 0.77% nr_task=8 kernel swapout(KB/s) map_swap_entry(perf) swapin(KB/s) map_swap_entry(perf) vanilla 306783 92.19% 318145 87.76% patched 954437 +211% 2.35% 1073741 +237% 1.57% swapout: the throughput of swap out, in KB/s, higher is better 1st map_swap_entry: cpu cycles percent sampled by perf swapin: the throughput of swap in, in KB/s, higher is better. 2nd map_swap_entry: cpu cycles percent sampled by perf nr_task=1 doesn't show any difference, this is due to the curr_swap_extent can be effectively used to cache the correct swap extent for single task workload. [[email protected]: s/BUG_ON(1)/BUG()/] Link: http://lkml.kernel.org/r/20190523142404.GA181@aaronlu Signed-off-by: Aaron Lu <[email protected]> Cc: Huang Ying <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-07-12	mm/swap_state.c: simplify total_swapcache_pages() with get_swap_device()	Huang Ying	1	-18/+15
	total_swapcache_pages() may race with swapper_spaces[] allocation and freeing. Previously, this is protected with a swapper_spaces[] specific RCU mechanism. To simplify the logic/code complexity, it is replaced with get/put_swap_device(). The code line number is reduced too. Although not so important, the swapoff() performance improves too because one synchronize_rcu() call during swapoff() is deleted. [[email protected]: fix bad swap file entry warning] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: "Huang, Ying" <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Tested-by: Mike Kravetz <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Tim Chen <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Jérôme Glisse <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Yang Shi <[email protected]> Cc: David Rientjes <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Jan Kara <[email protected]> Cc: Dave Jiang <[email protected]> Cc: Daniel Jordan <[email protected]> Cc: Andrea Parri <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-07-12	mm, swap: fix race between swapoff and some swap operations	Huang Ying	3	-36/+136
	When swapin is performed, after getting the swap entry information from the page table, system will swap in the swap entry, without any lock held to prevent the swap device from being swapoff. This may cause the race like below, CPU 1 CPU 2 ----- ----- do_swap_page swapin_readahead __read_swap_cache_async swapoff swapcache_prepare p->swap_map = NULL __swap_duplicate p->swap_map[?] /* !!! NULL pointer access */ Because swapoff is usually done when system shutdown only, the race may not hit many people in practice. But it is still a race need to be fixed. To fix the race, get_swap_device() is added to check whether the specified swap entry is valid in its swap device. If so, it will keep the swap entry valid via preventing the swap device from being swapoff, until put_swap_device() is called. Because swapoff() is very rare code path, to make the normal path runs as fast as possible, rcu_read_lock/unlock() and synchronize_rcu() instead of reference count is used to implement get/put_swap_device(). >From get_swap_device() to put_swap_device(), RCU reader side is locked, so synchronize_rcu() in swapoff() will wait until put_swap_device() is called. In addition to swap_map, cluster_info, etc. data structure in the struct swap_info_struct, the swap cache radix tree will be freed after swapoff, so this patch fixes the race between swap cache looking up and swapoff too. Races between some other swap cache usages and swapoff are fixed too via calling synchronize_rcu() between clearing PageSwapCache() and freeing swap cache data structure. Another possible method to fix this is to use preempt_off() + stop_machine() to prevent the swap device from being swapoff when its data structure is being accessed. The overhead in hot-path of both methods is similar. The advantages of RCU based method are, 1. stop_machine() may disturb the normal execution code path on other CPUs. 2. File cache uses RCU to protect its radix tree. If the similar mechanism is used for swap cache too, it is easier to share code between them. 3. RCU is used to protect swap cache in total_swapcache_pages() and exit_swap_address_space() already. The two mechanisms can be merged to simplify the logic. Link: http://lkml.kernel.org/r/[email protected] Fixes: 235b62176712 ("mm/swap: add cluster lock") Signed-off-by: "Huang, Ying" <[email protected]> Reviewed-by: Andrea Parri <[email protected]> Not-nacked-by: Hugh Dickins <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Daniel Jordan <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Tim Chen <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Jérôme Glisse <[email protected]> Cc: Yang Shi <[email protected]> Cc: David Rientjes <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Jan Kara <[email protected]> Cc: Dave Jiang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>