blaster4385/linux-IllusionX - Linux kernel with personal config changes for arch linux

Age	Commit message (Collapse)	Author	Files	Lines
2016-12-12	printk/NMI: handle continuous lines and missing newline	Petr Mladek	1	-28/+50
	Commit 4bcc595ccd80 ("printk: reinstate KERN_CONT for printing continuation lines") added back KERN_CONT message header. As a result it might appear in the middle of the line when the parts are squashed via the temporary NMI buffer. A reasonable solution seems to be to split the text in the NNI temporary not only by newlines but also by the message headers. Another solution would be to filter out KERN_CONT when writing to the temporary buffer. But this would complicate the lockless handling. Also it would not solve problems with a missing newline that was there even before the KERN_CONT stuff. This patch moves the temporary buffer handling into separate function. I played with it and it seems that using the char pointers make the code easier to read. Also it prints the final newline as a continuous line. Finally, it moves handling of the s->len overflow into the paranoid check. And allows to recover from the disaster. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Petr Mladek <[email protected]> Reviewed-by: Sergey Senozhatsky <[email protected]> Cc: Joe Perches <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Jason Wessel <[email protected]> Cc: Jaroslav Kysela <[email protected]> Cc: Takashi Iwai <[email protected]> Cc: Chris Mason <[email protected]> Cc: Josef Bacik <[email protected]> Cc: David Sterba <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-12	printk/NMI: fix up handling of the full nmi log buffer	Petr Mladek	1	-2/+3
	vsnprintf() adds the trailing '\0' but it does not count it into the number of printed characters. The result is that there is one byte less space for the real characters in the buffer. The broken check for the free space might cause that we will repeatedly try to print 1 character into the buffer, never reach the full buffer, and do not count the messages as missed. Also vsnprintf() returns the number of characters that would be printed if the buffer was big enough. As a result, s->len might be bigger than the size of the buffer[]. And the printk() function might return bigger len than it really printed. Both problems are fixed by using vscnprintf() instead. Note that I though about increasing the number of missed messages even when the message was shrunken. But it made the code even more complicated. I think that it is not worth it. Shrunken messages are usually easy to recognize. And it should be a corner case. [] The overflown s->len value is crazy and unexpected. I "made a mistake" and reported this situation as an internal error when fixed handling of PR_CONT headers in some other patch. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Petr Mladek <[email protected]> CcL Sergey Senozhatsky <[email protected]> Cc: Chris Mason <[email protected]> Cc: David Sterba <[email protected]> Cc: Jason Wessel <[email protected]> Cc: Josef Bacik <[email protected]> Cc: Joe Perches <[email protected]> Cc: Jaroslav Kysela <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Takashi Iwai <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-12	compiler-gcc.h: use "proved" instead of "proofed"	Benjamin Peterson	1	-1/+1
	Link: http://lkml.kernel.org/r/1477894241.1103202.772260161.1B0A5995@webmail.messagingengine.com Signed-off-by: Benjamin Peterson <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-12	hung_task: decrement sysctl_hung_task_warnings only if it is positive	Tetsuo Handa	1	-1/+2
	Since sysctl_hung_task_warnings == -1 is allowed (infinite warnings), commit 48a6d64edadb ("hung_task: allow hung_task_panic when hung_task_warnings is 0") should decrement it only when it is not -1. This prevents the kernel from ceasing warnings after the first 4294967295 ;) Signed-off-by: Tetsuo Handa <[email protected]> Cc: John Siddle <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-12	fs/proc: calculate /proc/* and /proc//task/ nlink at init time	Alexey Dobriyan	3	-6/+15
	Runtime nlink calculation works but meh. I don't know how to do it at compile time, but I know how to do it at init time. Shift "2+" part into init time as a bonus. Link: http://lkml.kernel.org/r/20161122195549.GB29812@avx2 Signed-off-by: Alexey Dobriyan <[email protected]> Reviewed-by: Vegard Nossum <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-12	fs/proc/base.c: save decrement during lookup/readdir in /proc/$PID	Alexey Dobriyan	1	-4/+4
	Comparison for "<" works equally well as comparison for "<=" but one SUB/LEA is saved (no, it is not optimised away, at least here). Link: http://lkml.kernel.org/r/20161122195143.GA29812@avx2 Signed-off-by: Alexey Dobriyan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-12	fs/proc/array.c: slightly improve render_sigset_t	Rasmus Villemoes	1	-1/+1
	format_decode and vsnprintf occasionally show up in perf top, so I went looking for places that might not need the full printf power. With the help of kprobes, I gathered some statistics on which format strings we mostly pass to vsnprintf. On a trivial desktop workload, I hit "%x" 25% of the time, so something apparently reads /proc/pid/status (which does 5*16 printf("%x") calls) a lot. With this patch, reading /proc/pid/status is 30% faster according to this microbenchmark: char buf[4096]; int i, fd; for (i = 0; i < 10000; ++i) { fd = open("/proc/self/status", O_RDONLY); read(fd, buf, sizeof(buf)); close(fd); } Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Rasmus Villemoes <[email protected]> Acked-by: Andrei Vagin <[email protected]> Acked-by: Kees Cook <[email protected]> Cc: Alexey Dobriyan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-12	proc: tweak comments about 2 stage open and everything	Alexey Dobriyan	1	-8/+21
	Some comments were obsoleted since commit 05c0ae21c034 ("try a saner locking for pde_opener..."). Some new comments added. Some confusing comments replaced with equally confusing ones. Link: http://lkml.kernel.org/r/20161029160231.GD1246@avx2 Signed-off-by: Alexey Dobriyan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-12	proc: kmalloc struct pde_opener	Alexey Dobriyan	1	-1/+3
	kzalloc is too much, half of the fields will be reinitialized anyway. If proc file doesn't have ->release hook (some still do not), clearing is unnecessary because it will be freed immediately. Link: http://lkml.kernel.org/r/20161029155747.GC1246@avx2 Signed-off-by: Alexey Dobriyan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-12	proc: fix type of struct pde_opener::closing field	Alexey Dobriyan	2	-2/+2
	struct pde_opener::closing is boolean. Link: http://lkml.kernel.org/r/20161029155439.GB1246@avx2 Signed-off-by: Alexey Dobriyan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-12	proc: just list_del() struct pde_opener	Alexey Dobriyan	1	-1/+1
	list_del_init() is too much, structure will be freed in three lines anyway. Link: http://lkml.kernel.org/r/20161029155313.GA1246@avx2 Signed-off-by: Alexey Dobriyan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-12	proc: make struct struct map_files_info::len unsigned int	Alexey Dobriyan	1	-1/+1
	Linux doesn't support 4GB+ filenames in /proc, so unsigned long is too much. MOV r64, r/m64 is larger than MOV r32, r/m32. Link: http://lkml.kernel.org/r/20161029161123.GG1246@avx2 Signed-off-by: Alexey Dobriyan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-12	proc: make struct pid_entry::len unsigned	Alexey Dobriyan	1	-1/+1
	"unsigned int" is better on x86_64 because it most of the time it autoexpands to 64-bit value while "int" requires MOVSX instruction. Link: http://lkml.kernel.org/r/20161029160810.GF1246@avx2 Signed-off-by: Alexey Dobriyan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-12	proc: report no_new_privs state	Kees Cook	2	-2/+5
	Similar to being able to examine if a process has been correctly confined with seccomp, the state of no_new_privs is equally interesting, so this adds it to /proc/$pid/status. Link: http://lkml.kernel.org/r/20161103214041.GA58566@beast Signed-off-by: Kees Cook <[email protected]> Reviewed-by: Jann Horn <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Konstantin Khlebnikov <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Rodrigo Freire <[email protected]> Cc: John Stultz <[email protected]> Cc: Ross Zwisler <[email protected]> Cc: Robert Ho <[email protected]> Cc: Jerome Marchand <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: "Richard W.M. Jones" <[email protected]> Cc: Joe Perches <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-12	mm/percpu.c: fix panic triggered by BUG_ON() falsely	zijun_hu	1	-4/+12
	As shown by pcpu_build_alloc_info(), the number of units within a percpu group is deduced by rounding up the number of CPUs within the group to @upa boundary/ Therefore, the number of CPUs isn't equal to the units's if it isn't aligned to @upa normally. However, pcpu_page_first_chunk() uses BUG_ON() to assert that one number is equal to the other roughly, so a panic is maybe triggered by the BUG_ON() incorrectly. In order to fix this issue, the number of CPUs is rounded up then compared with units's and the BUG_ON() is replaced with a warning and return of an error code as well, to keep system alive as much as possible. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: zijun_hu <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Christoph Lameter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-12	kasan: turn on -fsanitize-address-use-after-scope	Andrey Ryabinin	1	-0/+2
	In the upcoming gcc7 release, the -fsanitize=kernel-address option at first implied new -fsanitize-address-use-after-scope option. This would cause link errors on older kernels because they don't have two new functions required for use-after-scope support. Therefore, gcc7 changed default to -fno-sanitize-address-use-after-scope. Now the kernel has everything required for that feature since commit 828347f8f9a5 ("kasan: support use-after-scope detection"). So, to make it work, we just have to enable use-after-scope in CFLAGS. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Andrey Ryabinin <[email protected]> Acked-by: Dmitry Vyukov <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Andrey Konovalov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-12	kasan: eliminate long stalls during quarantine reduction	Dmitry Vyukov	1	-46/+48
	Currently we dedicate 1/32 of RAM for quarantine and then reduce it by 1/4 of total quarantine size. This can be a significant amount of memory. For example, with 4GB of RAM total quarantine size is 128MB and it is reduced by 32MB at a time. With 128GB of RAM total quarantine size is 4GB and it is reduced by 1GB. This leads to several problems: - freeing 1GB can take tens of seconds, causes rcu stall warnings and just introduces unexpected long delays at random places - if kmalloc() is called under a mutex, other threads stall on that mutex while a thread reduces quarantine - threads wait on quarantine_lock while one thread grabs a large batch of objects to evict - we walk the uncached list of object to free twice which makes all of the above worse - when a thread frees objects, they are already not accounted against global_quarantine.bytes; as the result we can have quarantine_size bytes in quarantine + unbounded amount of memory in large batches in threads that are in process of freeing Reduce size of quarantine in smaller batches to reduce the delays. The only reason to reduce it in batches is amortization of overheads, the new batch size of 1MB should be well enough to amortize spinlock lock/unlock and few function calls. Plus organize quarantine as a FIFO array of batches. This allows to not walk the list in quarantine_reduce() under quarantine_lock, which in turn reduces contention and is just faster. This improves performance of heavy load (syzkaller fuzzing) by ~20% with 4 CPUs and 32GB of RAM. Also this eliminates frequent (every 5 sec) drops of CPU consumption from ~400% to ~100% (one thread reduces quarantine while others are waiting on a mutex). Some reference numbers: 1. Machine with 4 CPUs and 4GB of memory. Quarantine size 128MB. Currently we free 32MB at at time. With new code we free 1MB at a time (1024 batches, ~128 are used). 2. Machine with 32 CPUs and 128GB of memory. Quarantine size 4GB. Currently we free 1GB at at time. With new code we free 8MB at a time (1024 batches, ~512 are used). 3. Machine with 4096 CPUs and 1TB of memory. Quarantine size 32GB. Currently we free 8GB at at time. With new code we free 4MB at a time (16K batches, ~8K are used). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Dmitry Vyukov <[email protected]> Cc: Eric Dumazet <[email protected]> Cc: Greg Thelen <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Andrey Konovalov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-12	kasan: support panic_on_warn	Dmitry Vyukov	1	-0/+2
	If user sets panic_on_warn, he wants kernel to panic if there is anything barely wrong with the kernel. KASAN-detected errors are definitely not less benign than an arbitrary kernel WARNING. Panic after KASAN errors if panic_on_warn is set. We use this for continuous fuzzing where we want kernel to stop and reboot on any error. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Dmitry Vyukov <[email protected]> Acked-by: Andrey Ryabinin <[email protected]> Cc: Alexander Potapenko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-12	mm: make transparent hugepage size public	Hugh Dickins	2	-0/+15
	Test programs want to know the size of a transparent hugepage. While it is commonly the same as the size of a hugetlbfs page (shown as Hugepagesize in /proc/meminfo), that is not always so: powerpc implements transparent hugepages in a different way from hugetlbfs pages, so it's coincidence when their sizes are the same; and x86 and others can support more than one hugetlbfs page size. Add /sys/kernel/mm/transparent_hugepage/hpage_pmd_size to show the THP size in bytes - it's the same for Anonymous and Shmem hugepages. Call it hpage_pmd_size (after HPAGE_PMD_SIZE) rather than hpage_size, in case some transparent support for pud and pgd pages is added later. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Hugh Dickins <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Greg Thelen <[email protected]> Cc: David Rientjes <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Dan Williams <[email protected]> Cc: Jan Kara <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-12	mm: add cond_resched() in gather_pte_stats()	Hugh Dickins	1	-0/+1
	The other pagetable walks in task_mmu.c have a cond_resched() after walking their ptes: add a cond_resched() in gather_pte_stats() too, for reading /proc/<id>/numa_maps. Only pagemap_pmd_range() has a cond_resched() in its (unusually expensive) pmd_trans_huge case: more should probably be added, but leave them unchanged for now. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Hugh Dickins <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: David Rientjes <[email protected]> Cc: Gerald Schaefer <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-12	mm: add three more cond_resched() in swapoff	Hugh Dickins	1	-7/+6
	Add a cond_resched() in the unuse_pmd_range() loop (so as to call it even when pmd none or trans_huge, like zap_pmd_range() does); and in the unuse_mm() loop (since that might skip over many vmas). shmem_unuse() and radix_tree_locate_item() look good enough already. Those were the obvious places, but in fact the stalls came from find_next_to_unuse(), which sometimes scans through many unused entries. Apply scan_swap_map()'s LATENCY_LIMIT of 256 there too; and only go off to test frontswap_map when a used entry is found. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Hugh Dickins <[email protected]> Reported-by: Eric Dumazet <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-12	mm, page_alloc: keep pcp count and list contents in sync if struct page is ↵	Mel Gorman	1	-2/+10
	corrupted Vlastimil Babka pointed out that commit 479f854a207c ("mm, page_alloc: defer debugging checks of pages allocated from the PCP") will allow the per-cpu list counter to be out of sync with the per-cpu list contents if a struct page is corrupted. The consequence is an infinite loop if the per-cpu lists get fully drained by free_pcppages_bulk because all the lists are empty but the count is positive. The infinite loop occurs here do { batch_free++; if (++migratetype == MIGRATE_PCPTYPES) migratetype = 0; list = &pcp->lists[migratetype]; } while (list_empty(list)); What the user sees is a bad page warning followed by a soft lockup with interrupts disabled in free_pcppages_bulk(). This patch keeps the accounting in sync. Fixes: 479f854a207c ("mm, page_alloc: defer debugging checks of pages allocated from the PCP") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mel Gorman <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Hillf Danton <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jesper Dangaard Brouer <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: <[email protected]> [4.7+] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-12	mm, rmap: handle anon_vma_prepare() common case inline	Vlastimil Babka	2	-36/+43
	anon_vma_prepare() is mostly a large "if (unlikely(...))" block, as the expected common case is that an anon_vma already exists. We could turn the condition around and return 0, but it also makes sense to do it inline and avoid a call for the common case. Bloat-o-meter naturally shows that inlining the check has some code size costs: add/remove: 1/1 grow/shrink: 4/0 up/down: 475/-373 (102) function old new delta __anon_vma_prepare - 359 +359 handle_mm_fault 2744 2796 +52 hugetlb_cow 1146 1170 +24 hugetlb_fault 2123 2145 +22 wp_page_copy 1469 1487 +18 anon_vma_prepare 373 - -373 Checking the asm however confirms that the hot paths now avoid a call, which is moved away. [[email protected]: coding-style fixes] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Vlastimil Babka <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Konstantin Khlebnikov <[email protected]> Cc: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-12	mm, debug: print raw struct page data in __dump_page()	Vlastimil Babka	1	-0/+4
	__dump_page() is used when a page metadata inconsistency is detected, either by standard runtime checks, or extra checks in CONFIG_DEBUG_VM builds. It prints some of the relevant metadata, but not the whole struct page, which is based on unions and interpretation is dependent on the context. This means that sometimes e.g. a VM_BUG_ON_PAGE() checks certain field, which is however not printed by __dump_page() and the resulting bug report may then lack clues that could help in determining the root cause. This patch solves the problem by simply printing the whole struct page word by word, so no part is missing, but the interpretation of the data is left to developers. This is similar to e.g. x86_64 raw stack dumps. Example output: page:ffffea00000475c0 count:1 mapcount:0 mapping: (null) index:0x0 flags: 0x100000000000400(reserved) raw: 0100000000000400 0000000000000000 0000000000000000 00000001ffffffff raw: ffffea00000475e0 ffffea00000475e0 0000000000000000 0000000000000000 page dumped because: VM_BUG_ON_PAGE(1) [[email protected]: suggested print_hex_dump()] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Vlastimil Babka <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Acked-by: Andrey Ryabinin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-12	mm: THP page cache support for ppc64	Aneesh Kumar K.V	6	-17/+100
	Add arch specific callback in the generic THP page cache code that will deposit and withdarw preallocated page table. Archs like ppc64 use this preallocated table to store the hash pte slot information. Testing: kernel build of the patch series on tmpfs mounted with option huge=always The related thp stat: thp_fault_alloc 72939 thp_fault_fallback 60547 thp_collapse_alloc 603 thp_collapse_alloc_failed 0 thp_file_alloc 253763 thp_file_mapped 4251 thp_split_page 51518 thp_split_page_failed 1 thp_deferred_split_page 73566 thp_split_pmd 665 thp_zero_page_alloc 3 thp_zero_page_alloc_failed 0 [[email protected]: remove unneeded parentheses, per Kirill] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Aneesh Kumar K.V <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Michael Neuling <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Balbir Singh <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-12	mm: move vma_is_anonymous check within pmd_move_must_withdraw	Aneesh Kumar K.V	3	-15/+18
	Independent of whether the vma is for anonymous memory, some arches like ppc64 would like to override pmd_move_must_withdraw(). One option is to encapsulate the vma_is_anonymous() check for general architectures inside pmd_move_must_withdraw() so that is always called and architectures that need unconditional overriding can override this function. ppc64 needs to override the function when the MMU is configured to use hash PTE's. [[email protected]: reworked changelog] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Aneesh Kumar K.V <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Acked-by: Michael Ellerman <[email protected]> (powerpc) Cc: Benjamin Herrenschmidt <[email protected]> Cc: Michael Neuling <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Balbir Singh <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-12	mm: add preempt points into __purge_vmap_area_lazy()	Joel Fernandes	1	-5/+9
	Use cond_resched_lock to avoid holding the vmap_area_lock for a potentially long time and thus creating bad latencies for various workloads. [hch: split from a larger patch by Joel, wrote the crappy changelog] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Joel Fernandes <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Tested-by: Jisheng Zhang <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Chris Wilson <[email protected]> Cc: John Dias <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-12	mm: turn vmap_purge_lock into a mutex	Christoph Hellwig	1	-7/+7
	The purge_lock spinlock causes high latencies with non RT kernel. This has been reported multiple times on lkml [1] [2] and affects applications like audio. This patch replaces it with a mutex to allow preemption while holding the lock. Thanks to Joel Fernandes for the detailed report and analysis as well as an earlier attempt at fixing this issue. [1] http://lists.openwall.net/linux-kernel/2016/03/23/29 [2] https://lkml.org/lkml/2016/10/9/59 Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Tested-by: Jisheng Zhang <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Joel Fernandes <[email protected]> Cc: Chris Wilson <[email protected]> Cc: John Dias <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-12	mm: mark all calls into the vmalloc subsystem as potentially sleeping	Christoph Hellwig	1	-1/+6
	We will take a sleeping lock in later in this series, so this adds the proper safeguards. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Tested-by: Jisheng Zhang <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Joel Fernandes <[email protected]> Cc: Chris Wilson <[email protected]> Cc: John Dias <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-12	x86/ldt: use vfree_atomic() to free ldt entries	Andrey Ryabinin	1	-1/+1
	vfree() is going to use sleeping lock. free_ldt_struct() may be called with disabled preemption, therefore we must use vfree_atomic() here. E.g. call trace: vfree() free_ldt_struct() destroy_context_ldt() __mmdrop() finish_task_switch() schedule_tail() ret_from_fork() Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Andrey Ryabinin <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Cc: Joel Fernandes <[email protected]> Cc: Jisheng Zhang <[email protected]> Cc: Chris Wilson <[email protected]> Cc: John Dias <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-12	kernel/fork: use vfree_atomic() to free thread stack	Andrey Ryabinin	1	-1/+1
	vfree() is going to use sleeping lock. Thread stack freed in atomic context, therefore we must use vfree_atomic() here. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Andrey Ryabinin <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Cc: Joel Fernandes <[email protected]> Cc: Jisheng Zhang <[email protected]> Cc: Chris Wilson <[email protected]> Cc: John Dias <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-12	mm: add vfree_atomic()	Andrey Ryabinin	2	-6/+37
	We are going to use sleeping lock for freeing vmap. However some vfree() users want to free memory from atomic (but not from interrupt) context. For this we add vfree_atomic() - deferred variation of vfree() which can be used in any atomic context (except NMIs). [[email protected]: tweak comment grammar] [[email protected]: use raw_cpu_ptr() instead of this_cpu_ptr()] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Andrey Ryabinin <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Cc: Joel Fernandes <[email protected]> Cc: Jisheng Zhang <[email protected]> Cc: Chris Wilson <[email protected]> Cc: John Dias <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-12	mm: refactor __purge_vmap_area_lazy()	Christoph Hellwig	1	-45/+35
	Move the purge_lock synchronization to the callers, move the call to purge_fragmented_blocks_allcpus at the beginning of the function to the callers that need it, move the force_flush behavior to the caller that needs it, and pass start and end by value instead of by reference. No change in behavior. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Tested-by: Jisheng Zhang <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Joel Fernandes <[email protected]> Cc: Chris Wilson <[email protected]> Cc: John Dias <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-12	mm: remove free_unmap_vmap_area_addr()	Christoph Hellwig	1	-13/+8
	Just inline it into the only caller. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Tested-by: Jisheng Zhang <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Joel Fernandes <[email protected]> Cc: Chris Wilson <[email protected]> Cc: John Dias <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-12	mm: remove free_unmap_vmap_area_noflush()	Christoph Hellwig	1	-11/+2
	Patch series "reduce latency in __purge_vmap_area_lazy", v2. This patch (of 10): Sort out the long lock hold times in __purge_vmap_area_lazy. It is based on a patch from Joel. Inline free_unmap_vmap_area_noflush() it into the only caller. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Tested-by: Jisheng Zhang <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Joel Fernandes <[email protected]> Cc: Chris Wilson <[email protected]> Cc: John Dias <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-12	mm: workingset: update shadow limit to reflect bigger active list	Johannes Weiner	1	-19/+25
	Since commit 59dc76b0d4df ("mm: vmscan: reduce size of inactive file list") the size of the active file list is no longer limited to half of memory. Increase the shadow node limit accordingly to avoid throwing out shadow entries that might still result in eligible refaults. The exact size of the active list now depends on the overall size of the page cache, but converges toward taking up most of the space: In mm/vmscan.c::inactive_list_is_low(), * total target max * memory ratio inactive * ------------------------------------- * 10MB 1 5MB * 100MB 1 50MB * 1GB 3 250MB * 10GB 10 0.9GB * 100GB 31 3GB * 1TB 101 10GB * 10TB 320 32GB It would be possible to apply the same precise ratios when determining the limit for radix tree nodes containing shadow entries, but since it is merely an approximation of the oldest refault distances in the wild and the code also makes assumptions about the node population density, keep it simple and always target the full cache size. While at it, clarify the comment and the formula for memory footprint. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Cc: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-12	mm: workingset: restore refault tracking for single-page files	Johannes Weiner	1	-8/+1
	Shadow entries in the page cache used to be accounted behind the radix tree implementation's back in the upper bits of node->count, and the radix tree code extending a single-entry tree with a shadow entry in root->rnode would corrupt that counter. As a result, we could not put shadow entries at index 0 if the tree didn't have any other entries, and that means no refault detection for any single-page file. Now that the shadow entries are tracked natively in the radix tree's exceptional counter, this is no longer necessary. Extending and shrinking the tree from and to single entries in root->rnode now does the right thing when the entry is exceptional, remove that limitation. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Reviewed-by: Jan Kara <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-12	mm: workingset: move shadow entry tracking to radix tree exceptional tracking	Johannes Weiner	6	-138/+60
	Currently, we track the shadow entries in the page cache in the upper bits of the radix_tree_node->count, behind the back of the radix tree implementation. Because the radix tree code has no awareness of them, we rely on random subtleties throughout the implementation (such as the node->count != 1 check in the shrinking code, which is meant to exclude multi-entry nodes but also happens to skip nodes with only one shadow entry, as that's accounted in the upper bits). This is error prone and has, in fact, caused the bug fixed in d3798ae8c6f3 ("mm: filemap: don't plant shadow entries without radix tree node"). To remove these subtleties, this patch moves shadow entry tracking from the upper bits of node->count to the existing counter for exceptional entries. node->count goes back to being a simple counter of valid entries in the tree node and can be shrunk to a single byte. This vastly simplifies the page cache code. All accounting happens natively inside the radix tree implementation, and maintaining the LRU linkage of shadow nodes is consolidated into a single function in the workingset code that is called for leaf nodes affected by a change in the page cache tree. This also removes the last user of the __radix_delete_node() return value. Eliminate it. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Reviewed-by: Jan Kara <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-12	lib: radix-tree: update callback for changing leaf nodes	Johannes Weiner	4	-16/+36
	Support handing __radix_tree_replace() a callback that gets invoked for all leaf nodes that change or get freed as a result of the slot replacement, to assist users tracking nodes with node->private_list. This prepares for putting page cache shadow entries into the radix tree root again and drastically simplifying the shadow tracking. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Suggested-by: Jan Kara <[email protected]> Reviewed-by: Jan Kara <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-12	lib: radix-tree: add entry deletion support to __radix_tree_replace()	Johannes Weiner	1	-111/+116
	Page cache shadow entry handling will be a lot simpler when it can use a single generic replacement function for pages, shadow entries, and emptying slots. Make __radix_tree_replace() properly account insertions and deletions in node->count and garbage collect nodes as they become empty. Then re-implement radix_tree_delete() on top of it. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Reviewed-by: Jan Kara <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-12	lib: radix-tree: check accounting of existing slot replacement users	Johannes Weiner	10	-40/+64
	The bug in khugepaged fixed earlier in this series shows that radix tree slot replacement is fragile; and it will become more so when not only NULL<->!NULL transitions need to be caught but transitions from and to exceptional entries as well. We need checks. Re-implement radix_tree_replace_slot() on top of the sanity-checked __radix_tree_replace(). This requires existing callers to also pass the radix tree root, but it'll warn us when somebody replaces slots with contents that need proper accounting (transitions between NULL entries, real entries, exceptional entries) and where a replacement through the slot pointer would corrupt the radix tree node counts. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Suggested-by: Jan Kara <[email protected]> Reviewed-by: Jan Kara <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-12	lib: radix-tree: native accounting of exceptional entries	Johannes Weiner	4	-12/+57
	The way the page cache is sneaking shadow entries of evicted pages into the radix tree past the node entry accounting and tracking them manually in the upper bits of node->count is fraught with problems. These shadow entries are marked in the tree as exceptional entries, which are a native concept to the radix tree. Maintain an explicit counter of exceptional entries in the radix tree node. Subsequent patches will switch shadow entry tracking over to that counter. DAX and shmem are the other users of exceptional entries. Since slot replacements that change the entry type from regular to exceptional must now be accounted, introduce a __radix_tree_replace() function that does replacement and accounting, and switch DAX and shmem over. The increase in radix tree node size is temporary. A followup patch switches the shadow tracking to this new scheme and we'll no longer need the upper bits in node->count and shrink that back to one byte. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Reviewed-by: Jan Kara <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-12	mm: workingset: turn shadow node shrinker bugs into warnings	Johannes Weiner	1	-8/+12
	When the shadow page shrinker tries to reclaim a radix tree node but finds it in an unexpected state - it should contain no pages, and non-zero shadow entries - there is no need to kill the executing task or even the entire system. Warn about the invalid state, then leave that tree node be. Simply don't put it back on the shadow LRU for future reclaim and move on. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Reviewed-by: Jan Kara <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-12	mm: khugepaged: fix radix tree node leak in shmem collapse error path	Johannes Weiner	1	-2/+4
	The radix tree counts valid entries in each tree node. Entries stored in the tree cannot be removed by simpling storing NULL in the slot or the internal counters will be off and the node never gets freed again. When collapsing a shmem page fails, restore the holes that were filled with radix_tree_insert() with a proper radix tree deletion. Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Reported-by: Jan Kara <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Reviewed-by: Jan Kara <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-12	mm: khugepaged: close use-after-free race during shmem collapsing	Johannes Weiner	1	-0/+5
	Patch series "mm: workingset: radix tree subtleties & single-page file refaults", v3. This is another revision of the radix tree / workingset patches based on feedback from Jan and Kirill. This is a follow-up to d3798ae8c6f3 ("mm: filemap: don't plant shadow entries without radix tree node"). That patch fixed an issue that was caused mainly by the page cache sneaking special shadow page entries into the radix tree and relying on subtleties in the radix tree code to make that work. The fix also had to stop tracking refaults for single-page files because shadow pages stored as direct pointers in radix_tree_root->rnode weren't properly handled during tree extension. These patches make the radix tree code explicitely support and track such special entries, to eliminate the subtleties and to restore the thrash detection for single-page files. This patch (of 9): When a radix tree iteration drops the tree lock, another thread might swoop in and free the node holding the current slot. The iteration needs to do another tree lookup from the current index to continue. [[email protected]: re-lookup for replacement] Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Reviewed-by: Jan Kara <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-12	include/linux/backing-dev-defs.h: shrink struct backing_dev_info	Andrew Morton	1	-1/+1
	Move the 4-byte `capabilities' field next to other 4-byte things. Shrinks sizeof(backing_dev_info) by 8 bytes on x86_64. Reviewed-by: Jens Axboe <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-12	mm: don't cap request size based on read-ahead setting	Jens Axboe	4	-11/+31
	We ran into a funky issue, where someone doing 256K buffered reads saw 128K requests at the device level. Turns out it is read-ahead capping the request size, since we use 128K as the default setting. This doesn't make a lot of sense - if someone is issuing 256K reads, they should see 256K reads, regardless of the read-ahead setting, if the underlying device can support a 256K read in a single command. This patch introduces a bdi hint, io_pages. This is the soft max IO size for the lower level, I've hooked it up to the bdev settings here. Read-ahead is modified to issue the maximum of the user request size, and the read-ahead max size, but capped to the max request size on the device side. The latter is done to avoid reading ahead too much, if the application asks for a huge read. With this patch, the kernel behaves like the application expects. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]> Acked-by: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-12	shmem: fix compilation warnings on unused functions	Jérémy Lefaure	1	-0/+2
	Compiling shmem.c with SHMEM and TRANSAPRENT_HUGE_PAGECACHE enabled raises warnings on two unused functions when CONFIG_TMPFS and CONFIG_SYSFS are both disabled: mm/shmem.c:390:20: warning: `shmem_format_huge' defined but not used [-Wunused-function] static const char shmem_format_huge(int huge) ^~~~~~~~~~~~~~~~~ mm/shmem.c:373:12: warning: `shmem_parse_huge' defined but not used [-Wunused-function] static int shmem_parse_huge(const char str) ^~~~~~~~~~~~~~~~ A conditional compilation on tmpfs or sysfs removes the warnings. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jérémy Lefaure <[email protected]> Acked-by: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-12	fs/fs-writeback.c: remove redundant if check	Tahsin Erdogan	1	-9/+7
	b_more_io non-empty check is already preceded by an opposite check. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Tahsin Erdogan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-12	mm/filemap.c: add comment for confusing logic in page_cache_tree_insert()	Kirill A. Shutemov	1	-1/+4
	Unlike THP, hugetlb pages are represented by one entry in the radix-tree. [[email protected]: tweak comment] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Kirill A. Shutemov <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Jan Kara <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>