blaster4385/linux-IllusionX - Linux kernel with personal config changes for arch linux

Age	Commit message (Collapse)	Author	Files	Lines
2017-09-08	lib/bitmap.c: make bitmap_parselist() thread-safe and much faster	Yury Norov	1	-12/+6
	Current implementation of bitmap_parselist() uses a static variable to save local state while setting bits in the bitmap. It is obviously wrong if we assume execution in multiprocessor environment. Fortunately, it's possible to rewrite this portion of code to avoid using the static variable. It is also possible to set bits in the mask per-range with bitmap_set(), not per-bit, as it is implemented now, with set_bit(); which is way faster. The important side effect of this change is that setting bits in this function from now is not per-bit atomic and less memory-ordered. This is because set_bit() guarantees the order of memory accesses, while bitmap_set() does not. I think that it is the advantage of the new approach, because the bitmap_parselist() is intended to initialise bit arrays, and user should protect the whole bitmap during initialisation if needed. So protecting individual bits looks expensive and useless. Also, other range-oriented functions in lib/bitmap.c don't worry much about atomicity. With all that, setting 2k bits in map with the pattern like 0-2047:128/256 becomes ~50 times faster after applying the patch in my testing environment (arm64 hosted on qemu). The second patch of the series adds the test for bitmap_parselist(). It's not intended to cover all tricky cases, just to make sure that I didn't screw up during rework. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Yury Norov <[email protected]> Cc: Noam Camus <[email protected]> Cc: Rasmus Villemoes <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mauro Carvalho Chehab <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08	lib: add test module for CONFIG_DEBUG_VIRTUAL	Florian Fainelli	3	-0/+61
	Add a test module that allows testing that CONFIG_DEBUG_VIRTUAL works correctly, at least that it can catch invalid calls to virt_to_phys() against the non-linear kernel virtual address map. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Florian Fainelli <[email protected]> Cc: "Luis R. Rodriguez" <[email protected]> Cc: Kees Cook <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08	lib/hexdump.c: return -EINVAL in case of error in hex2bin()	Andy Shevchenko	1	-2/+3
	In some cases caller would like to use error code directly without shadowing. -EINVAL feels a rightful code to return in case of error in hex2bin(). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Andy Shevchenko <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Rasmus Villemoes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08	block/cfq: cache rightmost rb_node	Davidlohr Bueso	1	-5/+14
	For the same reasons we already cache the leftmost pointer, apply the same optimization for rb_last() calls. Users must explicitly do this as rb_root_cached only deals with the smallest node. [[email protected]: brain fart #1] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Davidlohr Bueso <[email protected]> Cc: Jens Axboe <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08	mem/memcg: cache rightmost node	Davidlohr Bueso	1	-5/+18
	Such that we can optimize __mem_cgroup_largest_soft_limit_node(). The only overhead is the extra footprint for the cached pointer, but this should not be an issue for mem_cgroup_tree_per_node. [[email protected]: brain fart #2] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Davidlohr Bueso <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Vladimir Davydov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08	fs/epoll: use faster rb_first_cached()	Davidlohr Bueso	1	-14/+16
	... such that we can avoid the tree walks to get the node with the smallest key. Semantically the same, as the previously used rb_first(), but O(1). The main overhead is the extra footprint for the cached rb_node pointer, which should not matter for epoll. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Davidlohr Bueso <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Cc: Jan Kara <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08	procfs: use faster rb_first_cached()	Davidlohr Bueso	4	-15/+17
	... such that we can avoid the tree walks to get the node with the smallest key. Semantically the same, as the previously used rb_first(), but O(1). The main overhead is the extra footprint for the cached rb_node pointer, which should not matter for procfs. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Davidlohr Bueso <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08	lib/interval-tree: correct comment wrt generic flavor	Davidlohr Bueso	1	-1/+1
	interval_tree.h _is_ the generic flavor. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Davidlohr Bueso <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08	lib/interval_tree: fast overlap detection	Davidlohr Bueso	33	-105/+145
	Allow interval trees to quickly check for overlaps to avoid unnecesary tree lookups in interval_tree_iter_first(). As of this patch, all interval tree flavors will require using a 'rb_root_cached' such that we can have the leftmost node easily available. While most users will make use of this feature, those with special functions (in addition to the generic insert, delete, search calls) will avoid using the cached option as they can do funky things with insertions -- for example, vma_interval_tree_insert_after(). [[email protected]: fix deadlock from typo vm_lock_anon_vma()] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Davidlohr Bueso <[email protected]> Signed-off-by: Jérôme Glisse <[email protected]> Acked-by: Christian König <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Doug Ledford <[email protected]> Acked-by: Michael S. Tsirkin <[email protected]> Cc: David Airlie <[email protected]> Cc: Jason Wang <[email protected]> Cc: Christian Benvenuti <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08	block/cfq: replace cfq_rb_root leftmost caching	Davidlohr Bueso	1	-50/+20
	... with the generic rbtree flavor instead. No changes in semantics whatsoever. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Davidlohr Bueso <[email protected]> Reviewed-by: Jan Kara <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Cc: Jens Axboe <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08	locking/rtmutex: replace top-waiter and pi_waiters leftmost caching	Davidlohr Bueso	7	-44/+27
	... with the generic rbtree flavor instead. No changes in semantics whatsoever. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Davidlohr Bueso <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08	sched/deadline: replace earliest dl and rq leftmost caching	Davidlohr Bueso	2	-35/+21
	... with the generic rbtree flavor instead. No changes in semantics whatsoever. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Davidlohr Bueso <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08	sched/fair: replace cfs_rq->rb_leftmost	Davidlohr Bueso	3	-29/+15
	... with the generic rbtree flavor instead. No changes in semantics whatsoever. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Davidlohr Bueso <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08	lib/rbtree_test.c: support rb_root_cached	Davidlohr Bueso	1	-19/+137
	We can work with a single rb_root_cached root to test both cached and non-cached rbtrees. In addition, also add a test to measure latencies between rb_first and its fast counterpart. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Davidlohr Bueso <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08	lib/rbtree_test.c: add (inorder) traversal test	Davidlohr Bueso	1	-2/+23
	This adds a second test for regular rb-tree testing in that there is no need to repeat it for the augmented flavor. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Davidlohr Bueso <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08	lib/rbtree_test.c: make input module parameters	Davidlohr Bueso	1	-21/+34
	Allows for more flexible debugging. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Davidlohr Bueso <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08	rbtree: add some additional comments for rebalancing cases	Davidlohr Bueso	1	-3/+5
	While overall the code is very nicely commented, it might not be immediately obvious from the diagrams what is going on. Add a very brief summary of each case. Opposite cases where the node is the left child are left untouched. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Davidlohr Bueso <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08	rbtree: optimize root-check during rebalancing loop	Davidlohr Bueso	1	-7/+16
	The only times the nil-parent (root node) condition is true is when the node is the first in the tree, or after fixing rbtree rule #4 and the case 1 rebalancing made the node the root. Such conditions do not apply most of the time: (i) The common case in an rbtree is to have more than a single node, so this is only true for the first rb_insert(). (ii) While there is a chance only one first rotation is needed, cases where the node's uncle is black (cases 2,3) are more common as we can have the following scenarios during the rotation looping: case1 only, case1+1, case2+3, case1+2+3, case3 only, etc. This patch, therefore, adds an unlikely() optimization to this conditional. When profiling with CONFIG_PROFILE_ANNOTATED_BRANCHES, a kernel build shows that the incorrect rate is less than 15%, and for workloads that involve insert mostly trees overtime tend to have less than 2% incorrect rate. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Davidlohr Bueso <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08	rbtree: cache leftmost node internally	Davidlohr Bueso	4	-8/+113
	Patch series "rbtree: Cache leftmost node internally", v4. A series to extending rbtrees to internally cache the leftmost node such that we can have fast overlap check optimization for all interval tree users[1]. The benefits of this series are that: (i) Unify users that do internal leftmost node caching. (ii) Optimize all interval tree users. (iii) Convert at least two new users (epoll and procfs) to the new interface. This patch (of 16): Red-black tree semantics imply that nodes with smaller or greater (or equal for duplicates) keys always be to the left and right, respectively. For the kernel this is extremely evident when considering our rb_first() semantics. Enabling lookups for the smallest node in the tree in O(1) can save a good chunk of cycles in not having to walk down the tree each time. To this end there are a few core users that explicitly do this, such as the scheduler and rtmutexes. There is also the desire for interval trees to have this optimization allowing faster overlap checking. This patch introduces a new 'struct rb_root_cached' which is just the root with a cached pointer to the leftmost node. The reason why the regular rb_root was not extended instead of adding a new structure was that this allows the user to have the choice between memory footprint and actual tree performance. The new wrappers on top of the regular rb_root calls are: - rb_first_cached(cached_root) -- which is a fast replacement for rb_first. - rb_insert_color_cached(node, cached_root, new) - rb_erase_cached(node, cached_root) In addition, augmented cached interfaces are also added for basic insertion and deletion operations; which becomes important for the interval tree changes. With the exception of the inserts, which adds a bool for updating the new leftmost, the interfaces are kept the same. To this end, porting rb users to the cached version becomes really trivial, and keeping current rbtree semantics for users that don't care about the optimization requires zero overhead. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Davidlohr Bueso <[email protected]> Reviewed-by: Jan Kara <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08	bitops: avoid integer overflow in GENMASK(_ULL)	Matthias Kaehlcke	1	-2/+3
	GENMASK(_ULL) performs a left-shift of ~0UL(L), which technically results in an integer overflow. clang raises a warning if the overflow occurs in a preprocessor expression. Clear the low-order bits through a substraction instead of the left-shift to avoid the overflow. (akpm: no change in .text size in my testing) Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Matthias Kaehlcke <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08	include: warn for inconsistent endian config definition	Babu Moger	2	-0/+8
	We have seen some generic code use config parameter CONFIG_CPU_BIG_ENDIAN to decide the endianness. Here are the few examples. include/asm-generic/qrwlock.h drivers/of/base.c drivers/of/fdt.c drivers/tty/serial/earlycon.c drivers/tty/serial/serial_core.c Display warning if CPU_BIG_ENDIAN is not defined on big endian architecture and also warn if it defined on little endian architectures. Here is our original discussion https://lkml.org/lkml/2017/5/24/620 Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Babu Moger <[email protected]> Suggested-by: Arnd Bergmann <[email protected]> Acked-by: Geert Uytterhoeven <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Alexander Viro <[email protected]> Cc: David S. Miller <[email protected]> Cc: Greg KH <[email protected]> Cc: Helge Deller <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jonas Bonn <[email protected]> Cc: Max Filippov <[email protected]> Cc: Michael Ellerman <[email protected]> (powerpc) Cc: Michal Simek <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Stafford Horne <[email protected]> Cc: Stefan Kristiansson <[email protected]> Cc: Yoshinori Sato <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08	arch/microblaze: add choice for endianness and update Makefile	Babu Moger	2	-0/+18
	microblaze architectures can be configured for either little or big endian formats. Add a choice option for the user to select the correct endian format(default to big endian). Also update the Makefile so toolchain can compile for the format it is configured for. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Babu Moger <[email protected]> Signed-off-by: Arnd Bergmann <[email protected]> Cc: Michal Simek <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Alexander Viro <[email protected]> Cc: David S. Miller <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Greg KH <[email protected]> Cc: Helge Deller <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jonas Bonn <[email protected]> Cc: Max Filippov <[email protected]> Cc: Michael Ellerman <[email protected]> (powerpc) Cc: Peter Zijlstra <[email protected]> Cc: Stafford Horne <[email protected]> Cc: Stefan Kristiansson <[email protected]> Cc: Yoshinori Sato <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08	arch: define CPU_BIG_ENDIAN for all fixed big endian archs	Babu Moger	6	-0/+18
	Patch series "Define CPU_BIG_ENDIAN or warn for inconsistencies", v3. While working on enabling queued rwlock on SPARC, found this following code in include/asm-generic/qrwlock.h which uses CONFIG_CPU_BIG_ENDIAN to clear a byte. static inline u8 __qrwlock_write_byte(struct qrwlock lock) { return (u8 )lock + 3 IS_BUILTIN(CONFIG_CPU_BIG_ENDIAN); } Problem is many of the fixed big endian architectures don't define CPU_BIG_ENDIAN and clears the wrong byte. Define CPU_BIG_ENDIAN for all the fixed big endian architecture to fix it. Also found few more references of this config parameter in drivers/of/base.c drivers/of/fdt.c drivers/tty/serial/earlycon.c drivers/tty/serial/serial_core.c Be aware that this may cause regressions if someone has worked-around problems in the above code already. Remove the work-around. Here is our original discussion https://lkml.org/lkml/2017/5/24/620 Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Babu Moger <[email protected]> Suggested-by: Arnd Bergmann <[email protected]> Acked-by: Geert Uytterhoeven <[email protected]> Acked-by: David S. Miller <[email protected]> Acked-by: Stafford Horne <[email protected]> Cc: Yoshinori Sato <[email protected]> Cc: Jonas Bonn <[email protected]> Cc: Stefan Kristiansson <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Helge Deller <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Michal Simek <[email protected]> Cc: Michael Ellerman <[email protected]> (powerpc) Cc: Peter Zijlstra <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Max Filippov <[email protected]> Cc: Greg KH <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08	treewide: make "nr_cpu_ids" unsigned	Alexey Dobriyan	17	-21/+21
	First, number of CPUs can't be negative number. Second, different signnnedness leads to suboptimal code in the following cases: 1) kmalloc(nr_cpu_ids * sizeof(X)); "int" has to be sign extended to size_t. 2) while (loff_t *pos < nr_cpu_ids) MOVSXD is 1 byte longed than the same MOV. Other cases exist as well. Basically compiler is told that nr_cpu_ids can't be negative which can't be deduced if it is "int". Code savings on allyesconfig kernel: -3KB add/remove: 0/0 grow/shrink: 25/264 up/down: 261/-3631 (-3370) function old new delta coretemp_cpu_online 450 512 +62 rcu_init_one 1234 1272 +38 pci_device_probe 374 399 +25 ... pgdat_reclaimable_pages 628 556 -72 select_fallback_rq 446 369 -77 task_numa_find_cpu 1923 1807 -116 Link: http://lkml.kernel.org/r/20170819114959.GA30580@avx2 Signed-off-by: Alexey Dobriyan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08	vga: optimise console scrolling	Matthew Wilcox	4	-0/+53
	Where possible, call memset16(), memmove() or memcpy() instead of using open-coded loops. I don't like the calling convention that uses a byte count instead of a count of u16s, but it's a little late to change that. Reduces code size of fbcon.o by almost 400 bytes on my laptop build. [[email protected]: fix build] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: David Miller <[email protected]> Cc: Sam Ravnborg <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: "Martin K. Petersen" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Ivan Kokshaysky <[email protected]> Cc: Matt Turner <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Richard Henderson <[email protected]> Cc: Russell King <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08	drivers/scsi/sym53c8xx_2/sym_hipd.c: convert to use memset32	Matthew Wilcox	1	-8/+3
	memset32() can be used to initialise these three arrays. Minor code footprint reduction. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: "Martin K. Petersen" <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: David Miller <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Ivan Kokshaysky <[email protected]> Cc: Matt Turner <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Richard Henderson <[email protected]> Cc: Russell King <[email protected]> Cc: Sam Ravnborg <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08	drivers/block/zram/zram_drv.c: convert to using memset_l	Matthew Wilcox	1	-11/+2
	zram was the motivation for creating memset_l(). Minchan Kim sees a 7% performance improvement on x86 with 100MB of non-zero deduplicatable data: perf stat -r 10 dd if=/dev/zram0 of=/dev/null vanilla: 0.232050465 seconds time elapsed ( +- 0.51% ) memset_l: 0.217219387 seconds time elapsed ( +- 0.07% ) Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox <[email protected]> Tested-by: Minchan Kim <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: "Martin K. Petersen" <[email protected]> Cc: David Miller <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Ivan Kokshaysky <[email protected]> Cc: Matt Turner <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Richard Henderson <[email protected]> Cc: Russell King <[email protected]> Cc: Sam Ravnborg <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08	alpha: add support for memset16	Matthew Wilcox	3	-13/+14
	Alpha already had an optimised fill-memory-with-16-bit-quantity assembler routine called memsetw(). It has a slightly different calling convention from memset16() in that it takes a byte count, not a count of words. That's the same convention used by ARM's __memset routines, so rename Alpha's routine to match and add a memset16() wrapper around it. Then convert Alpha's scr_memsetw() to call memset16() instead of memsetw(). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox <[email protected]> Cc: Richard Henderson <[email protected]> Cc: Ivan Kokshaysky <[email protected]> Cc: Matt Turner <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: "Martin K. Petersen" <[email protected]> Cc: David Miller <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Russell King <[email protected]> Cc: Sam Ravnborg <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08	ARM: implement memset32 & memset64	Matthew Wilcox	3	-6/+34
	Reuse the existing optimised memset implementation to implement an optimised memset32 and memset64. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox <[email protected]> Reviewed-by: Russell King <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: "Martin K. Petersen" <[email protected]> Cc: David Miller <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Ivan Kokshaysky <[email protected]> Cc: Matt Turner <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Richard Henderson <[email protected]> Cc: Sam Ravnborg <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08	x86: implement memset16, memset32 & memset64	Matthew Wilcox	2	-0/+60
	These are single instructions on x86. There's no 64-bit instruction for x86-32, but we don't yet have any user for memset64() on 32-bit architectures, so don't bother to implement it. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: "Martin K. Petersen" <[email protected]> Cc: David Miller <[email protected]> Cc: Ivan Kokshaysky <[email protected]> Cc: Matt Turner <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Richard Henderson <[email protected]> Cc: Russell King <[email protected]> Cc: Sam Ravnborg <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08	lib/string.c: add testcases for memset16/32/64	Matthew Wilcox	2	-0/+132
	[[email protected]: minor tweaks] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: "Martin K. Petersen" <[email protected]> Cc: David Miller <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Ivan Kokshaysky <[email protected]> Cc: Matt Turner <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Richard Henderson <[email protected]> Cc: Russell King <[email protected]> Cc: Sam Ravnborg <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08	lib/string.c: add multibyte memset functions	Matthew Wilcox	2	-0/+96
	Patch series "Multibyte memset variations", v4. A relatively common idiom we're missing is a function to fill an area of memory with a pattern which is larger than a single byte. I first noticed this with a zram patch which wanted to fill a page with an 'unsigned long' value. There turn out to be quite a few places in the kernel which can benefit from using an optimised function rather than a loop; sometimes text size, sometimes speed, and sometimes both. The optimised PowerPC version (not included here) improves performance by about 30% on POWER8 on just the raw memset_l(). Most of the extra lines of code come from the three testcases I added. This patch (of 8): memset16(), memset32() and memset64() are like memset(), but allow the caller to fill the destination with a value larger than a single byte. memset_l() and memset_p() allow the caller to use unsigned long and pointer values respectively. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: "Martin K. Petersen" <[email protected]> Cc: David Miller <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Ivan Kokshaysky <[email protected]> Cc: Matt Turner <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Richard Henderson <[email protected]> Cc: Russell King <[email protected]> Cc: Sam Ravnborg <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08	linux/kernel.h: move DIV_ROUND_DOWN_ULL() macro	Masahiro Yamada	3	-8/+5
	This macro is useful to avoid link error on 32-bit systems. We have the same definition in two drivers, so move it to include/linux/kernel.h While we are here, refactor DIV_ROUND_UP_ULL() by using DIV_ROUND_DOWN_ULL(). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Masahiro Yamada <[email protected]> Acked-by: Mark Brown <[email protected]> Cc: Cyrille Pitchen <[email protected]> Cc: Jaroslav Kysela <[email protected]> Cc: Takashi Iwai <[email protected]> Cc: Liam Girdwood <[email protected]> Cc: Boris Brezillon <[email protected]> Cc: Marek Vasut <[email protected]> Cc: Brian Norris <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: David Woodhouse <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08	fs, proc: unconditional cond_resched when reading smaps	David Rientjes	1	-2/+3
	If there are large numbers of hugepages to iterate while reading /proc/pid/smaps, the page walk never does cond_resched(). On archs without split pmd locks, there can be significant and observable contention on mm->page_table_lock which cause lengthy delays without rescheduling. Always reschedule in smaps_pte_range() if necessary since the pagewalk iteration can be expensive. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: David Rientjes <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08	proc: uninline proc_create()	Alexey Dobriyan	2	-7/+9
	Save some code from ~320 invocations all clearing last argument. add/remove: 3/0 grow/shrink: 0/158 up/down: 45/-702 (-657) function old new delta proc_create - 17 +17 __ksymtab_proc_create - 16 +16 __kstrtab_proc_create - 12 +12 yam_init_driver 301 298 -3 ... cifs_proc_init 249 228 -21 via_fb_pci_probe 2304 2280 -24 Link: http://lkml.kernel.org/r/20170819094702.GA27864@avx2 Signed-off-by: Alexey Dobriyan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08	fs, proc: remove priv argument from is_stack	Michal Hocko	2	-8/+5
	Commit b18cb64ead40 ("fs/proc: Stop trying to report thread stacks") removed the priv parameter user in is_stack so the argument is redundant. Drop it. [[email protected]: remove unused variable] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Signed-off-by: Arnd Bergmann <[email protected]> Cc: Andy Lutomirski <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08	mm/mempolicy.c: remove BUG_ON() checks for VMA inside mpol_misplaced()	Anshuman Khandual	1	-5/+0
	VMA and its address bounds checks are too late in this function. They must have been verified earlier in the page fault sequence. Hence just remove them. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Anshuman Khandual <[email protected]> Suggested-by: Vlastimil Babka <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08	mm/swapfile.c: fix swapon frontswap_map memory leak on error	David Rientjes	1	-0/+1
	Free frontswap_map if an error is encountered before enable_swap_info(). Signed-off-by: David Rientjes <[email protected]> Reviewed-by: "Huang, Ying" <[email protected]> Cc: Darrick J. Wong <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: <[email protected]> [4.12+] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08	mm: kvfree the swap cluster info if the swap file is unsatisfactory	Darrick J. Wong	1	-1/+1
	If initializing a small swap file fails because the swap file has a problem (holes, etc.) then we need to free the cluster info as part of cleanup. Unfortunately a previous patch changed the code to use kvzalloc but did not change all the vfree calls to use kvfree. Found by running generic/357 from xfstests. Link: http://lkml.kernel.org/r/20170831233515.GR3775@magnolia Fixes: 54f180d3c181 ("mm, swap: use kvzalloc to allocate some swap data structures") Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: "Huang, Ying" <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: <[email protected]> [4.12+] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08	mm/page_alloc.c: apply gfp_allowed_mask before the first allocation attempt	Tetsuo Handa	1	-1/+2
	We are by error initializing alloc_flags before gfp_allowed_mask is applied. This could cause problems after pm_restrict_gfp_mask() is called during suspend operation. Apply gfp_allowed_mask before initializing alloc_flags so that the first allocation attempt uses correct flags. Link: http://lkml.kernel.org/r/[email protected] Fixes: 83d4ca8148fd9092 ("mm, page_alloc: move __GFP_HARDWALL modifications out of the fastpath") Signed-off-by: Tetsuo Handa <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Jesper Dangaard Brouer <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08	tools/testing/selftests/kcmp/kcmp_test.c: add KCMP_EPOLL_TFD testing	Cyrill Gorcunov	1	-2/+58
	KCMP's KCMP_EPOLL_TFD mode merged in commit 0791e3644e5ef2 ("kcmp: add KCMP_EPOLL_TFD mode to compare epoll target files") we've had no selftest for it yet (except in criu development list). Thus add it. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Cyrill Gorcunov <[email protected]> Cc: Andrey Vagin <[email protected]> Cc: Pavel Emelyanov <[email protected]> Cc: Michael Kerrisk <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08	mm/sparse.c: fix typo in online_mem_sections	Michal Hocko	1	-1/+1
	online_mem_sections() accidentally marks online only the first section in the given range. This is a typo which hasn't been noticed because I haven't tested large 2GB blocks previously. All users of pfn_to_online_page would get confused on the the rest of the pfn range in the block. All we need to fix this is to use iterator (pfn) rather than start_pfn. Link: http://lkml.kernel.org/r/[email protected] Fixes: 2d070eab2e82 ("mm: consider zone which is not fully populated to have holes") Signed-off-by: Michal Hocko <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08	mm/memory.c: fix mem_cgroup_oom_disable() call missing	Laurent Dufour	1	-5/+5
	Seen while reading the code, in handle_mm_fault(), in the case arch_vma_access_permitted() is failing the call to mem_cgroup_oom_disable() is not made. To fix that, move the call to mem_cgroup_oom_enable() after calling arch_vma_access_permitted() as it should not have entered the memcg OOM. Link: http://lkml.kernel.org/r/[email protected] Fixes: bae473a423f6 ("mm: introduce fault_env") Signed-off-by: Laurent Dufour <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08	mm: memcontrol: use per-cpu stocks for socket memory uncharging	Roman Gushchin	1	-2/+4
	We've noticed a quite noticeable performance overhead on some hosts with significant network traffic when socket memory accounting is enabled. Perf top shows that socket memory uncharging path is hot: 2.13% [kernel] [k] page_counter_cancel 1.14% [kernel] [k] __sk_mem_reduce_allocated 1.14% [kernel] [k] _raw_spin_lock 0.87% [kernel] [k] _raw_spin_lock_irqsave 0.84% [kernel] [k] tcp_ack 0.84% [kernel] [k] ixgbe_poll 0.83% < workload > 0.82% [kernel] [k] enqueue_entity 0.68% [kernel] [k] __fget 0.68% [kernel] [k] tcp_delack_timer_handler 0.67% [kernel] [k] __schedule 0.60% < workload > 0.59% [kernel] [k] __inet6_lookup_established 0.55% [kernel] [k] __switch_to 0.55% [kernel] [k] menu_select 0.54% libc-2.20.so [.] __memcpy_avx_unaligned To address this issue, the existing per-cpu stock infrastructure can be used. refill_stock() can be called from mem_cgroup_uncharge_skmem() to move charge to a per-cpu stock instead of calling atomic page_counter_uncharge(). To prevent the uncontrolled growth of per-cpu stocks, refill_stock() will explicitly drain the cached charge, if the cached value exceeds CHARGE_BATCH. This allows significantly optimize the load: 1.21% [kernel] [k] _raw_spin_lock 1.01% [kernel] [k] ixgbe_poll 0.92% [kernel] [k] _raw_spin_lock_irqsave 0.90% [kernel] [k] enqueue_entity 0.86% [kernel] [k] tcp_ack 0.85% < workload > 0.74% perf-11120.map [.] 0x000000000061bf24 0.73% [kernel] [k] __schedule 0.67% [kernel] [k] __fget 0.63% [kernel] [k] __inet6_lookup_established 0.62% [kernel] [k] menu_select 0.59% < workload > 0.59% [kernel] [k] __switch_to 0.57% libc-2.20.so [.] __memcpy_avx_unaligned Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Roman Gushchin <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Vladimir Davydov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08	mm: fadvise: avoid fadvise for fs without backing device	Shakeel Butt	1	-3/+3
	The fadvise() manpage is silent on fadvise()'s effect on memory-based filesystems (shmem, hugetlbfs & ramfs) and pseudo file systems (procfs, sysfs, kernfs). The current implementaion of fadvise is mostly a noop for such filesystems except for FADV_DONTNEED which will trigger expensive remote LRU cache draining. This patch makes the noop of fadvise() on such file systems very explicit. However this change has two side effects for ramfs and one for tmpfs. First fadvise(FADV_DONTNEED) could remove the unmapped clean zero'ed pages of ramfs (allocated through read, readahead & read fault) and tmpfs (allocated through read fault). Also fadvise(FADV_WILLNEED) could create such clean zero'ed pages for ramfs. This change removes those possibilities. One of our generic libraries does fadvise(FADV_DONTNEED). Recently we observed high latency in fadvise() and noticed that the users have started using tmpfs files and the latency was due to expensive remote LRU cache draining. For normal tmpfs files (have data written on them), fadvise(FADV_DONTNEED) will always trigger the unneeded remote cache draining. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Shakeel Butt <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Greg Thelen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08	mm/zsmalloc.c: change stat type parameter to int	Matthias Kaehlcke	1	-3/+6
	zs_stat_inc/dec/get() uses enum zs_stat_type for the stat type, however some callers pass an enum fullness_group value. Change the type to int to reflect the actual use of the functions and get rid of 'enum-conversion' warnings Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Matthias Kaehlcke <[email protected]> Reviewed-by: Sergey Senozhatsky <[email protected]> Acked-by: Minchan Kim <[email protected]> Cc: Doug Anderson <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08	mm/mlock.c: use page_zone() instead of page_zone_id()	Joonsoo Kim	1	-6/+4
	page_zone_id() is a specialized function to compare the zone for the pages that are within the section range. If the section of the pages are different, page_zone_id() can be different even if their zone is the same. This wrong usage doesn't cause any actual problem since __munlock_pagevec_fill() would be called again with failed index. However, it's better to use more appropriate function here. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Joonsoo Kim <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08	mm: consider the number in local CPUs when reading NUMA stats	Kemi Wang	2	-3/+12
	To avoid deviation, the per cpu number of NUMA stats in vm_numa_stat_diff[] is included when a user reads the NUMA stats. Since NUMA stats does not be read by users frequently, and kernel does not need it to make a decision, it will not be a problem to make the readers more expensive. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Kemi Wang <[email protected]> Reported-by: Jesper Dangaard Brouer <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Aaron Lu <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Christopher Lameter <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Tim Chen <[email protected]> Cc: Ying Huang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08	mm: update NUMA counter threshold size	Kemi Wang	2	-20/+11
	There is significant overhead in cache bouncing caused by zone counters (NUMA associated counters) update in parallel in multi-threaded page allocation (suggested by Dave Hansen). This patch updates NUMA counter threshold to a fixed size of MAX_U16 - 2, as a small threshold greatly increases the update frequency of the global counter from local per cpu counter(suggested by Ying Huang). The rationality is that these statistics counters don't affect the kernel's decision, unlike other VM counters, so it's not a problem to use a large threshold. With this patchset, we see 31.3% drop of CPU cycles(537-->369) for per single page allocation and reclaim on Jesper's page_bench03 benchmark. Benchmark provided by Jesper D Brouer(increase loop times to 10000000): https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/ bench Threshold CPU cycles Throughput(88 threads) 32 799 241760478 64 640 301628829 125 537 358906028 <==> system by default (base) 256 468 412397590 512 428 450550704 4096 399 482520943 20000 394 489009617 30000 395 488017817 65533 369(-31.3%) 521661345(+45.3%) <==> with this patchset N/A 342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Kemi Wang <[email protected]> Reported-by: Jesper Dangaard Brouer <[email protected]> Suggested-by: Dave Hansen <[email protected]> Suggested-by: Ying Huang <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Aaron Lu <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Christopher Lameter <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Tim Chen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08	mm: change the call sites of numa statistics items	Kemi Wang	5	-27/+220
	Patch series "Separate NUMA statistics from zone statistics", v2. Each page allocation updates a set of per-zone statistics with a call to zone_statistics(). As discussed in 2017 MM summit, these are a substantial source of overhead in the page allocator and are very rarely consumed. This significant overhead in cache bouncing caused by zone counters (NUMA associated counters) update in parallel in multi-threaded page allocation (pointed out by Dave Hansen). A link to the MM summit slides: http://people.netfilter.org/hawk/presentations/MM-summit2017/MM-summit2017-JesperBrouer.pdf To mitigate this overhead, this patchset separates NUMA statistics from zone statistics framework, and update NUMA counter threshold to a fixed size of MAX_U16 - 2, as a small threshold greatly increases the update frequency of the global counter from local per cpu counter (suggested by Ying Huang). The rationality is that these statistics counters don't need to be read often, unlike other VM counters, so it's not a problem to use a large threshold and make readers more expensive. With this patchset, we see 31.3% drop of CPU cycles(537-->369, see below) for per single page allocation and reclaim on Jesper's page_bench03 benchmark. Meanwhile, this patchset keeps the same style of virtual memory statistics with little end-user-visible effects (only move the numa stats to show behind zone page stats, see the first patch for details). I did an experiment of single page allocation and reclaim concurrently using Jesper's page_bench03 benchmark on a 2-Socket Broadwell-based server (88 processors with 126G memory) with different size of threshold of pcp counter. Benchmark provided by Jesper D Brouer(increase loop times to 10000000): https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench Threshold CPU cycles Throughput(88 threads) 32 799 241760478 64 640 301628829 125 537 358906028 <==> system by default 256 468 412397590 512 428 450550704 4096 399 482520943 20000 394 489009617 30000 395 488017817 65533 369(-31.3%) 521661345(+45.3%) <==> with this patchset N/A 342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics This patch (of 3): In this patch, NUMA statistics is separated from zone statistics framework, all the call sites of NUMA stats are changed to use numa-stats-specific functions, it does not have any functionality change except that the number of NUMA stats is shown behind zone page stats when users read the zone info. E.g. cat /proc/zoneinfo *Base* *With this patch* nr_free_pages 3976 nr_free_pages 3976 nr_zone_inactive_anon 0 nr_zone_inactive_anon 0 nr_zone_active_anon 0 nr_zone_active_anon 0 nr_zone_inactive_file 0 nr_zone_inactive_file 0 nr_zone_active_file 0 nr_zone_active_file 0 nr_zone_unevictable 0 nr_zone_unevictable 0 nr_zone_write_pending 0 nr_zone_write_pending 0 nr_mlock 0 nr_mlock 0 nr_page_table_pages 0 nr_page_table_pages 0 nr_kernel_stack 0 nr_kernel_stack 0 nr_bounce 0 nr_bounce 0 nr_zspages 0 nr_zspages 0 numa_hit 0 nr_free_cma 0 numa_miss 0 numa_hit 0 numa_foreign 0 numa_miss 0 numa_interleave 0 numa_foreign 0 numa_local 0 numa_interleave 0 numa_other 0 numa_local 0 nr_free_cma 0 numa_other 0 ... ... vm stats threshold: 10 vm stats threshold: 10 ... ... The next patch updates the numa stats counter size and threshold. [[email protected]: coding-style fixes] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Kemi Wang <[email protected]> Reported-by: Jesper Dangaard Brouer <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Christopher Lameter <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Ying Huang <[email protected]> Cc: Aaron Lu <[email protected]> Cc: Tim Chen <[email protected]> Cc: Dave Hansen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>