Age | Commit message (Collapse) | Author | Files | Lines |
|
Commit 4bcc595ccd80 ("printk: reinstate KERN_CONT for printing
continuation lines") added back KERN_CONT message header. As a result
it might appear in the middle of the line when the parts are squashed
via the temporary NMI buffer.
A reasonable solution seems to be to split the text in the NNI temporary
not only by newlines but also by the message headers.
Another solution would be to filter out KERN_CONT when writing to the
temporary buffer. But this would complicate the lockless handling.
Also it would not solve problems with a missing newline that was there
even before the KERN_CONT stuff.
This patch moves the temporary buffer handling into separate function.
I played with it and it seems that using the char pointers make the code
easier to read.
Also it prints the final newline as a continuous line.
Finally, it moves handling of the s->len overflow into the paranoid
check. And allows to recover from the disaster.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Petr Mladek <[email protected]>
Reviewed-by: Sergey Senozhatsky <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Jason Wessel <[email protected]>
Cc: Jaroslav Kysela <[email protected]>
Cc: Takashi Iwai <[email protected]>
Cc: Chris Mason <[email protected]>
Cc: Josef Bacik <[email protected]>
Cc: David Sterba <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
vsnprintf() adds the trailing '\0' but it does not count it into the
number of printed characters. The result is that there is one byte less
space for the real characters in the buffer.
The broken check for the free space might cause that we will repeatedly
try to print 1 character into the buffer, never reach the full buffer,
and do not count the messages as missed.
Also vsnprintf() returns the number of characters that would be printed
if the buffer was big enough. As a result, s->len might be bigger than
the size of the buffer[*]. And the printk() function might return
bigger len than it really printed. Both problems are fixed by using
vscnprintf() instead.
Note that I though about increasing the number of missed messages even
when the message was shrunken. But it made the code even more
complicated. I think that it is not worth it. Shrunken messages are
usually easy to recognize. And it should be a corner case.
[*] The overflown s->len value is crazy and unexpected. I "made a
mistake" and reported this situation as an internal error when fixed
handling of PR_CONT headers in some other patch.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Petr Mladek <[email protected]>
CcL Sergey Senozhatsky <[email protected]>
Cc: Chris Mason <[email protected]>
Cc: David Sterba <[email protected]>
Cc: Jason Wessel <[email protected]>
Cc: Josef Bacik <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: Jaroslav Kysela <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Takashi Iwai <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Link: http://lkml.kernel.org/r/1477894241.1103202.772260161.1B0A5995@webmail.messagingengine.com
Signed-off-by: Benjamin Peterson <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Since sysctl_hung_task_warnings == -1 is allowed (infinite warnings),
commit 48a6d64edadb ("hung_task: allow hung_task_panic when
hung_task_warnings is 0") should decrement it only when it is not -1.
This prevents the kernel from ceasing warnings after the first
4294967295 ;)
Signed-off-by: Tetsuo Handa <[email protected]>
Cc: John Siddle <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Runtime nlink calculation works but meh. I don't know how to do it at
compile time, but I know how to do it at init time.
Shift "2+" part into init time as a bonus.
Link: http://lkml.kernel.org/r/20161122195549.GB29812@avx2
Signed-off-by: Alexey Dobriyan <[email protected]>
Reviewed-by: Vegard Nossum <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Comparison for "<" works equally well as comparison for "<=" but one
SUB/LEA is saved (no, it is not optimised away, at least here).
Link: http://lkml.kernel.org/r/20161122195143.GA29812@avx2
Signed-off-by: Alexey Dobriyan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
format_decode and vsnprintf occasionally show up in perf top, so I went
looking for places that might not need the full printf power. With the
help of kprobes, I gathered some statistics on which format strings we
mostly pass to vsnprintf. On a trivial desktop workload, I hit "%x" 25%
of the time, so something apparently reads /proc/pid/status (which does
5*16 printf("%x") calls) a lot.
With this patch, reading /proc/pid/status is 30% faster according to
this microbenchmark:
char buf[4096];
int i, fd;
for (i = 0; i < 10000; ++i) {
fd = open("/proc/self/status", O_RDONLY);
read(fd, buf, sizeof(buf));
close(fd);
}
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Rasmus Villemoes <[email protected]>
Acked-by: Andrei Vagin <[email protected]>
Acked-by: Kees Cook <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Some comments were obsoleted since commit 05c0ae21c034 ("try a saner
locking for pde_opener...").
Some new comments added.
Some confusing comments replaced with equally confusing ones.
Link: http://lkml.kernel.org/r/20161029160231.GD1246@avx2
Signed-off-by: Alexey Dobriyan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
kzalloc is too much, half of the fields will be reinitialized anyway.
If proc file doesn't have ->release hook (some still do not), clearing
is unnecessary because it will be freed immediately.
Link: http://lkml.kernel.org/r/20161029155747.GC1246@avx2
Signed-off-by: Alexey Dobriyan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
struct pde_opener::closing is boolean.
Link: http://lkml.kernel.org/r/20161029155439.GB1246@avx2
Signed-off-by: Alexey Dobriyan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
list_del_init() is too much, structure will be freed in three lines
anyway.
Link: http://lkml.kernel.org/r/20161029155313.GA1246@avx2
Signed-off-by: Alexey Dobriyan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Linux doesn't support 4GB+ filenames in /proc, so unsigned long is too
much.
MOV r64, r/m64 is larger than MOV r32, r/m32.
Link: http://lkml.kernel.org/r/20161029161123.GG1246@avx2
Signed-off-by: Alexey Dobriyan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
"unsigned int" is better on x86_64 because it most of the time it
autoexpands to 64-bit value while "int" requires MOVSX instruction.
Link: http://lkml.kernel.org/r/20161029160810.GF1246@avx2
Signed-off-by: Alexey Dobriyan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Similar to being able to examine if a process has been correctly
confined with seccomp, the state of no_new_privs is equally interesting,
so this adds it to /proc/$pid/status.
Link: http://lkml.kernel.org/r/20161103214041.GA58566@beast
Signed-off-by: Kees Cook <[email protected]>
Reviewed-by: Jann Horn <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Konstantin Khlebnikov <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Rodrigo Freire <[email protected]>
Cc: John Stultz <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Robert Ho <[email protected]>
Cc: Jerome Marchand <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: "Richard W.M. Jones" <[email protected]>
Cc: Joe Perches <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
As shown by pcpu_build_alloc_info(), the number of units within a percpu
group is deduced by rounding up the number of CPUs within the group to
@upa boundary/ Therefore, the number of CPUs isn't equal to the units's
if it isn't aligned to @upa normally. However, pcpu_page_first_chunk()
uses BUG_ON() to assert that one number is equal to the other roughly,
so a panic is maybe triggered by the BUG_ON() incorrectly.
In order to fix this issue, the number of CPUs is rounded up then
compared with units's and the BUG_ON() is replaced with a warning and
return of an error code as well, to keep system alive as much as
possible.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: zijun_hu <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Christoph Lameter <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
In the upcoming gcc7 release, the -fsanitize=kernel-address option at
first implied new -fsanitize-address-use-after-scope option. This would
cause link errors on older kernels because they don't have two new
functions required for use-after-scope support. Therefore, gcc7 changed
default to -fno-sanitize-address-use-after-scope.
Now the kernel has everything required for that feature since commit
828347f8f9a5 ("kasan: support use-after-scope detection"). So, to make it
work, we just have to enable use-after-scope in CFLAGS.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Andrey Ryabinin <[email protected]>
Acked-by: Dmitry Vyukov <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Currently we dedicate 1/32 of RAM for quarantine and then reduce it by
1/4 of total quarantine size. This can be a significant amount of
memory. For example, with 4GB of RAM total quarantine size is 128MB and
it is reduced by 32MB at a time. With 128GB of RAM total quarantine
size is 4GB and it is reduced by 1GB. This leads to several problems:
- freeing 1GB can take tens of seconds, causes rcu stall warnings and
just introduces unexpected long delays at random places
- if kmalloc() is called under a mutex, other threads stall on that
mutex while a thread reduces quarantine
- threads wait on quarantine_lock while one thread grabs a large batch
of objects to evict
- we walk the uncached list of object to free twice which makes all of
the above worse
- when a thread frees objects, they are already not accounted against
global_quarantine.bytes; as the result we can have quarantine_size
bytes in quarantine + unbounded amount of memory in large batches in
threads that are in process of freeing
Reduce size of quarantine in smaller batches to reduce the delays. The
only reason to reduce it in batches is amortization of overheads, the
new batch size of 1MB should be well enough to amortize spinlock
lock/unlock and few function calls.
Plus organize quarantine as a FIFO array of batches. This allows to not
walk the list in quarantine_reduce() under quarantine_lock, which in
turn reduces contention and is just faster.
This improves performance of heavy load (syzkaller fuzzing) by ~20% with
4 CPUs and 32GB of RAM. Also this eliminates frequent (every 5 sec)
drops of CPU consumption from ~400% to ~100% (one thread reduces
quarantine while others are waiting on a mutex).
Some reference numbers:
1. Machine with 4 CPUs and 4GB of memory. Quarantine size 128MB.
Currently we free 32MB at at time.
With new code we free 1MB at a time (1024 batches, ~128 are used).
2. Machine with 32 CPUs and 128GB of memory. Quarantine size 4GB.
Currently we free 1GB at at time.
With new code we free 8MB at a time (1024 batches, ~512 are used).
3. Machine with 4096 CPUs and 1TB of memory. Quarantine size 32GB.
Currently we free 8GB at at time.
With new code we free 4MB at a time (16K batches, ~8K are used).
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Dmitry Vyukov <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Greg Thelen <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
If user sets panic_on_warn, he wants kernel to panic if there is
anything barely wrong with the kernel. KASAN-detected errors are
definitely not less benign than an arbitrary kernel WARNING.
Panic after KASAN errors if panic_on_warn is set.
We use this for continuous fuzzing where we want kernel to stop and
reboot on any error.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Dmitry Vyukov <[email protected]>
Acked-by: Andrey Ryabinin <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Test programs want to know the size of a transparent hugepage. While it
is commonly the same as the size of a hugetlbfs page (shown as
Hugepagesize in /proc/meminfo), that is not always so: powerpc
implements transparent hugepages in a different way from hugetlbfs
pages, so it's coincidence when their sizes are the same; and x86 and
others can support more than one hugetlbfs page size.
Add /sys/kernel/mm/transparent_hugepage/hpage_pmd_size to show the THP
size in bytes - it's the same for Anonymous and Shmem hugepages. Call
it hpage_pmd_size (after HPAGE_PMD_SIZE) rather than hpage_size, in case
some transparent support for pud and pgd pages is added later.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Hugh Dickins <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Greg Thelen <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Jan Kara <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The other pagetable walks in task_mmu.c have a cond_resched() after
walking their ptes: add a cond_resched() in gather_pte_stats() too, for
reading /proc/<id>/numa_maps. Only pagemap_pmd_range() has a
cond_resched() in its (unusually expensive) pmd_trans_huge case: more
should probably be added, but leave them unchanged for now.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Hugh Dickins <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Add a cond_resched() in the unuse_pmd_range() loop (so as to call it
even when pmd none or trans_huge, like zap_pmd_range() does); and in the
unuse_mm() loop (since that might skip over many vmas). shmem_unuse()
and radix_tree_locate_item() look good enough already.
Those were the obvious places, but in fact the stalls came from
find_next_to_unuse(), which sometimes scans through many unused entries.
Apply scan_swap_map()'s LATENCY_LIMIT of 256 there too; and only go off
to test frontswap_map when a used entry is found.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Hugh Dickins <[email protected]>
Reported-by: Eric Dumazet <[email protected]>
Cc: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
corrupted
Vlastimil Babka pointed out that commit 479f854a207c ("mm, page_alloc:
defer debugging checks of pages allocated from the PCP") will allow the
per-cpu list counter to be out of sync with the per-cpu list contents if
a struct page is corrupted.
The consequence is an infinite loop if the per-cpu lists get fully
drained by free_pcppages_bulk because all the lists are empty but the
count is positive. The infinite loop occurs here
do {
batch_free++;
if (++migratetype == MIGRATE_PCPTYPES)
migratetype = 0;
list = &pcp->lists[migratetype];
} while (list_empty(list));
What the user sees is a bad page warning followed by a soft lockup with
interrupts disabled in free_pcppages_bulk().
This patch keeps the accounting in sync.
Fixes: 479f854a207c ("mm, page_alloc: defer debugging checks of pages allocated from the PCP")
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Hillf Danton <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Jesper Dangaard Brouer <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: <[email protected]> [4.7+]
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
anon_vma_prepare() is mostly a large "if (unlikely(...))" block, as the
expected common case is that an anon_vma already exists. We could turn
the condition around and return 0, but it also makes sense to do it
inline and avoid a call for the common case.
Bloat-o-meter naturally shows that inlining the check has some code size
costs:
add/remove: 1/1 grow/shrink: 4/0 up/down: 475/-373 (102)
function old new delta
__anon_vma_prepare - 359 +359
handle_mm_fault 2744 2796 +52
hugetlb_cow 1146 1170 +24
hugetlb_fault 2123 2145 +22
wp_page_copy 1469 1487 +18
anon_vma_prepare 373 - -373
Checking the asm however confirms that the hot paths now avoid a call,
which is moved away.
[[email protected]: coding-style fixes]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Vlastimil Babka <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Konstantin Khlebnikov <[email protected]>
Cc: Rik van Riel <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
__dump_page() is used when a page metadata inconsistency is detected,
either by standard runtime checks, or extra checks in CONFIG_DEBUG_VM
builds. It prints some of the relevant metadata, but not the whole
struct page, which is based on unions and interpretation is dependent on
the context.
This means that sometimes e.g. a VM_BUG_ON_PAGE() checks certain field,
which is however not printed by __dump_page() and the resulting bug
report may then lack clues that could help in determining the root
cause. This patch solves the problem by simply printing the whole
struct page word by word, so no part is missing, but the interpretation
of the data is left to developers. This is similar to e.g. x86_64 raw
stack dumps.
Example output:
page:ffffea00000475c0 count:1 mapcount:0 mapping: (null) index:0x0
flags: 0x100000000000400(reserved)
raw: 0100000000000400 0000000000000000 0000000000000000 00000001ffffffff
raw: ffffea00000475e0 ffffea00000475e0 0000000000000000 0000000000000000
page dumped because: VM_BUG_ON_PAGE(1)
[[email protected]: suggested print_hex_dump()]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Vlastimil Babka <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Acked-by: Andrey Ryabinin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Add arch specific callback in the generic THP page cache code that will
deposit and withdarw preallocated page table. Archs like ppc64 use this
preallocated table to store the hash pte slot information.
Testing:
kernel build of the patch series on tmpfs mounted with option huge=always
The related thp stat:
thp_fault_alloc 72939
thp_fault_fallback 60547
thp_collapse_alloc 603
thp_collapse_alloc_failed 0
thp_file_alloc 253763
thp_file_mapped 4251
thp_split_page 51518
thp_split_page_failed 1
thp_deferred_split_page 73566
thp_split_pmd 665
thp_zero_page_alloc 3
thp_zero_page_alloc_failed 0
[[email protected]: remove unneeded parentheses, per Kirill]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Michael Neuling <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Balbir Singh <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Independent of whether the vma is for anonymous memory, some arches like
ppc64 would like to override pmd_move_must_withdraw().
One option is to encapsulate the vma_is_anonymous() check for general
architectures inside pmd_move_must_withdraw() so that is always called
and architectures that need unconditional overriding can override this
function. ppc64 needs to override the function when the MMU is
configured to use hash PTE's.
[[email protected]: reworked changelog]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Acked-by: Michael Ellerman <[email protected]> (powerpc)
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Michael Neuling <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Balbir Singh <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Use cond_resched_lock to avoid holding the vmap_area_lock for a
potentially long time and thus creating bad latencies for various
workloads.
[hch: split from a larger patch by Joel, wrote the crappy changelog]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Joel Fernandes <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Tested-by: Jisheng Zhang <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Chris Wilson <[email protected]>
Cc: John Dias <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The purge_lock spinlock causes high latencies with non RT kernel. This
has been reported multiple times on lkml [1] [2] and affects
applications like audio.
This patch replaces it with a mutex to allow preemption while holding
the lock.
Thanks to Joel Fernandes for the detailed report and analysis as well as
an earlier attempt at fixing this issue.
[1] http://lists.openwall.net/linux-kernel/2016/03/23/29
[2] https://lkml.org/lkml/2016/10/9/59
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Tested-by: Jisheng Zhang <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Chris Wilson <[email protected]>
Cc: John Dias <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
We will take a sleeping lock in later in this series, so this adds the
proper safeguards.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Tested-by: Jisheng Zhang <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Chris Wilson <[email protected]>
Cc: John Dias <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
vfree() is going to use sleeping lock. free_ldt_struct() may be called
with disabled preemption, therefore we must use vfree_atomic() here.
E.g. call trace:
vfree()
free_ldt_struct()
destroy_context_ldt()
__mmdrop()
finish_task_switch()
schedule_tail()
ret_from_fork()
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Andrey Ryabinin <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Jisheng Zhang <[email protected]>
Cc: Chris Wilson <[email protected]>
Cc: John Dias <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
vfree() is going to use sleeping lock. Thread stack freed in atomic
context, therefore we must use vfree_atomic() here.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Andrey Ryabinin <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Jisheng Zhang <[email protected]>
Cc: Chris Wilson <[email protected]>
Cc: John Dias <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
We are going to use sleeping lock for freeing vmap. However some
vfree() users want to free memory from atomic (but not from interrupt)
context. For this we add vfree_atomic() - deferred variation of vfree()
which can be used in any atomic context (except NMIs).
[[email protected]: tweak comment grammar]
[[email protected]: use raw_cpu_ptr() instead of this_cpu_ptr()]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Andrey Ryabinin <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Jisheng Zhang <[email protected]>
Cc: Chris Wilson <[email protected]>
Cc: John Dias <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Move the purge_lock synchronization to the callers, move the call to
purge_fragmented_blocks_allcpus at the beginning of the function to the
callers that need it, move the force_flush behavior to the caller that
needs it, and pass start and end by value instead of by reference.
No change in behavior.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Tested-by: Jisheng Zhang <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Chris Wilson <[email protected]>
Cc: John Dias <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Just inline it into the only caller.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Tested-by: Jisheng Zhang <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Chris Wilson <[email protected]>
Cc: John Dias <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "reduce latency in __purge_vmap_area_lazy", v2.
This patch (of 10):
Sort out the long lock hold times in __purge_vmap_area_lazy. It is
based on a patch from Joel.
Inline free_unmap_vmap_area_noflush() it into the only caller.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Tested-by: Jisheng Zhang <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Chris Wilson <[email protected]>
Cc: John Dias <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Since commit 59dc76b0d4df ("mm: vmscan: reduce size of inactive file
list") the size of the active file list is no longer limited to half of
memory. Increase the shadow node limit accordingly to avoid throwing
out shadow entries that might still result in eligible refaults.
The exact size of the active list now depends on the overall size of the
page cache, but converges toward taking up most of the space:
In mm/vmscan.c::inactive_list_is_low(),
* total target max
* memory ratio inactive
* -------------------------------------
* 10MB 1 5MB
* 100MB 1 50MB
* 1GB 3 250MB
* 10GB 10 0.9GB
* 100GB 31 3GB
* 1TB 101 10GB
* 10TB 320 32GB
It would be possible to apply the same precise ratios when determining
the limit for radix tree nodes containing shadow entries, but since it
is merely an approximation of the oldest refault distances in the wild
and the code also makes assumptions about the node population density,
keep it simple and always target the full cache size.
While at it, clarify the comment and the formula for memory footprint.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Johannes Weiner <[email protected]>
Cc: Rik van Riel <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Shadow entries in the page cache used to be accounted behind the radix
tree implementation's back in the upper bits of node->count, and the
radix tree code extending a single-entry tree with a shadow entry in
root->rnode would corrupt that counter. As a result, we could not put
shadow entries at index 0 if the tree didn't have any other entries, and
that means no refault detection for any single-page file.
Now that the shadow entries are tracked natively in the radix tree's
exceptional counter, this is no longer necessary. Extending and
shrinking the tree from and to single entries in root->rnode now does
the right thing when the entry is exceptional, remove that limitation.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Currently, we track the shadow entries in the page cache in the upper
bits of the radix_tree_node->count, behind the back of the radix tree
implementation. Because the radix tree code has no awareness of them,
we rely on random subtleties throughout the implementation (such as the
node->count != 1 check in the shrinking code, which is meant to exclude
multi-entry nodes but also happens to skip nodes with only one shadow
entry, as that's accounted in the upper bits). This is error prone and
has, in fact, caused the bug fixed in d3798ae8c6f3 ("mm: filemap: don't
plant shadow entries without radix tree node").
To remove these subtleties, this patch moves shadow entry tracking from
the upper bits of node->count to the existing counter for exceptional
entries. node->count goes back to being a simple counter of valid
entries in the tree node and can be shrunk to a single byte.
This vastly simplifies the page cache code. All accounting happens
natively inside the radix tree implementation, and maintaining the LRU
linkage of shadow nodes is consolidated into a single function in the
workingset code that is called for leaf nodes affected by a change in
the page cache tree.
This also removes the last user of the __radix_delete_node() return
value. Eliminate it.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Support handing __radix_tree_replace() a callback that gets invoked for
all leaf nodes that change or get freed as a result of the slot
replacement, to assist users tracking nodes with node->private_list.
This prepares for putting page cache shadow entries into the radix tree
root again and drastically simplifying the shadow tracking.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Johannes Weiner <[email protected]>
Suggested-by: Jan Kara <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Page cache shadow entry handling will be a lot simpler when it can use a
single generic replacement function for pages, shadow entries, and
emptying slots.
Make __radix_tree_replace() properly account insertions and deletions in
node->count and garbage collect nodes as they become empty. Then
re-implement radix_tree_delete() on top of it.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The bug in khugepaged fixed earlier in this series shows that radix tree
slot replacement is fragile; and it will become more so when not only
NULL<->!NULL transitions need to be caught but transitions from and to
exceptional entries as well. We need checks.
Re-implement radix_tree_replace_slot() on top of the sanity-checked
__radix_tree_replace(). This requires existing callers to also pass the
radix tree root, but it'll warn us when somebody replaces slots with
contents that need proper accounting (transitions between NULL entries,
real entries, exceptional entries) and where a replacement through the
slot pointer would corrupt the radix tree node counts.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Johannes Weiner <[email protected]>
Suggested-by: Jan Kara <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The way the page cache is sneaking shadow entries of evicted pages into
the radix tree past the node entry accounting and tracking them manually
in the upper bits of node->count is fraught with problems.
These shadow entries are marked in the tree as exceptional entries,
which are a native concept to the radix tree. Maintain an explicit
counter of exceptional entries in the radix tree node. Subsequent
patches will switch shadow entry tracking over to that counter.
DAX and shmem are the other users of exceptional entries. Since slot
replacements that change the entry type from regular to exceptional must
now be accounted, introduce a __radix_tree_replace() function that does
replacement and accounting, and switch DAX and shmem over.
The increase in radix tree node size is temporary. A followup patch
switches the shadow tracking to this new scheme and we'll no longer need
the upper bits in node->count and shrink that back to one byte.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
When the shadow page shrinker tries to reclaim a radix tree node but
finds it in an unexpected state - it should contain no pages, and
non-zero shadow entries - there is no need to kill the executing task or
even the entire system. Warn about the invalid state, then leave that
tree node be. Simply don't put it back on the shadow LRU for future
reclaim and move on.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The radix tree counts valid entries in each tree node. Entries stored
in the tree cannot be removed by simpling storing NULL in the slot or
the internal counters will be off and the node never gets freed again.
When collapsing a shmem page fails, restore the holes that were filled
with radix_tree_insert() with a proper radix tree deletion.
Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Johannes Weiner <[email protected]>
Reported-by: Jan Kara <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "mm: workingset: radix tree subtleties & single-page file
refaults", v3.
This is another revision of the radix tree / workingset patches based on
feedback from Jan and Kirill.
This is a follow-up to d3798ae8c6f3 ("mm: filemap: don't plant shadow
entries without radix tree node"). That patch fixed an issue that was
caused mainly by the page cache sneaking special shadow page entries
into the radix tree and relying on subtleties in the radix tree code to
make that work. The fix also had to stop tracking refaults for
single-page files because shadow pages stored as direct pointers in
radix_tree_root->rnode weren't properly handled during tree extension.
These patches make the radix tree code explicitely support and track
such special entries, to eliminate the subtleties and to restore the
thrash detection for single-page files.
This patch (of 9):
When a radix tree iteration drops the tree lock, another thread might
swoop in and free the node holding the current slot. The iteration
needs to do another tree lookup from the current index to continue.
[[email protected]: re-lookup for replacement]
Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Move the 4-byte `capabilities' field next to other 4-byte things.
Shrinks sizeof(backing_dev_info) by 8 bytes on x86_64.
Reviewed-by: Jens Axboe <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
We ran into a funky issue, where someone doing 256K buffered reads saw
128K requests at the device level. Turns out it is read-ahead capping
the request size, since we use 128K as the default setting. This
doesn't make a lot of sense - if someone is issuing 256K reads, they
should see 256K reads, regardless of the read-ahead setting, if the
underlying device can support a 256K read in a single command.
This patch introduces a bdi hint, io_pages. This is the soft max IO
size for the lower level, I've hooked it up to the bdev settings here.
Read-ahead is modified to issue the maximum of the user request size,
and the read-ahead max size, but capped to the max request size on the
device side. The latter is done to avoid reading ahead too much, if the
application asks for a huge read. With this patch, the kernel behaves
like the application expects.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Compiling shmem.c with SHMEM and TRANSAPRENT_HUGE_PAGECACHE enabled
raises warnings on two unused functions when CONFIG_TMPFS and
CONFIG_SYSFS are both disabled:
mm/shmem.c:390:20: warning: `shmem_format_huge' defined but not used [-Wunused-function]
static const char *shmem_format_huge(int huge)
^~~~~~~~~~~~~~~~~
mm/shmem.c:373:12: warning: `shmem_parse_huge' defined but not used [-Wunused-function]
static int shmem_parse_huge(const char *str)
^~~~~~~~~~~~~~~~
A conditional compilation on tmpfs or sysfs removes the warnings.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jérémy Lefaure <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
b_more_io non-empty check is already preceded by an opposite check.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Tahsin Erdogan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Unlike THP, hugetlb pages are represented by one entry in the
radix-tree.
[[email protected]: tweak comment]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Kirill A. Shutemov <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Jan Kara <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|