Age | Commit message (Collapse) | Author | Files | Lines |
|
We need interrupts disabled when calling console_trylock_for_printk()
only so that cpu id we pass to can_use_console() remains valid (for
other things console_sem provides all the exclusion we need and
deadlocks on console_sem due to interrupts are impossible because we use
down_trylock()). However if we are rescheduled, we are guaranteed to
run on an online cpu so we can easily just get the cpu id in
can_use_console().
We can lose a bit of performance when we enable interrupts in
vprintk_emit() and then disable them again in console_unlock() but OTOH
it can somewhat reduce interrupt latency caused by console_unlock()
especially since later in the patch series we will want to spin on
console_sem in console_trylock_for_printk().
Signed-off-by: Jan Kara <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Printk calls mutex_acquire() / mutex_release() by hand to instrument
lockdep about console_sem. However in some corner cases the
instrumentation is missing. Fix the problem by creating helper functions
for locking / unlocking console_sem which take care of lockdep
instrumentation as well.
Signed-off-by: Jan Kara <[email protected]>
Reported-by: Fabio Estevam <[email protected]>
Reported-by: Andy Shevchenko <[email protected]>
Tested-by: Fabio Estevam <[email protected]>
Tested-By: Valdis Kletnieks <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
There's no reason to hold lockbuf_lock when entering
console_trylock_for_printk().
The first thing this function does is to call down_trylock(console_sem)
and if that fails it immediately unlocks lockbuf_lock. So lockbuf_lock
isn't needed for that branch. When down_trylock() succeeds, the rest of
console_trylock() is OK without lockbuf_lock (it is called without it
from other places), and the only remaining thing in
console_trylock_for_printk() is can_use_console() call. For that call
console_sem is enough (it iterates all consoles and checks CON_ANYTIME
flag).
So we drop logbuf_lock before entering console_trylock_for_printk() which
simplifies the code.
[[email protected]: fix have_callable_console() comment]
Signed-off-by: Jan Kara <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Comment about interesting interlocking between lockbuf_lock and
console_sem is outdated.
It was added in 2002 by commit a880f45a48be during conversion of
console_lock to console_sem + lockbuf_lock.
At that time release_console_sem() (today's equivalent is
console_unlock()) was indeed using lockbuf_lock to avoid races between
trylock on console_sem in printk() and unlock of console_sem. However
these days the interlocking is gone and the races are avoided by
rechecking logbuf state after releasing console_sem.
Signed-off-by: Jan Kara <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
I wonder if anyone uses printk return value but it is there and should be
counted correctly.
This patch modifies log_store() to return the number of really stored
bytes from the 'text' part. Also it handles the return value in
vprintk_emit().
Note that log_store() is used also in cont_flush() but we could ignore the
return value there. The function works with characters that were already
counted earlier. In addition, the store could newer fail here because the
length of the printed text is limited by the "cont" buffer and "dict" is
NULL.
Signed-off-by: Petr Mladek <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jiri Kosina <[email protected]>
Cc: Kay Sievers <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
We might want to print at least part of too long messages and add some
warning for debugging purpose.
The question is how long the shrunken message should be. If we use the
whole buffer, it might get rotated too soon. Let's try to use only 1/4 of
the buffer for now.
Also shrink the whole dictionary. We do not want to parse it or break it
in the middle of some pair of values. It would not cause any real harm
but still.
Signed-off-by: Petr Mladek <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jiri Kosina <[email protected]>
Cc: Kay Sievers <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
We will want to recompute the message size when shrinking too long
messages. Let's put the code into separate function.
The side effect of setting "pad_len" is not nice but it is worth removing
the code duplication. Note that I will probably have one more usage for
this function when handling messages safe way in NMI context.
This patch does not change the existing behavior.
Signed-off-by: Petr Mladek <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jiri Kosina <[email protected]>
Cc: Kay Sievers <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
There was no check for too long messages. The check for free space always
passed when first_seq and next_seq were equal. Enough free space was not
guaranteed, though.
log_store() might be called to store messages up to 64kB + 64kB + 16B.
This is sum of maximal text_len, dict_len values, and the size of the
structure printk_log.
On the other hand, the minimal size for the main log buffer currently is
4kB and it is enforced only by Kconfig.
The good news is that the usage looks safe right now. log_store() is
called only from vprintk_emit() and cont_flush(). Here the "text" part is
always passed via a static buffer and the length is limited to
LOG_LINE_MAX which is 1024. The "dict" part is NULL in most cases. The
only exceptions is when vprintk_emit() is called from printk_emit() and
dev_vprintk_emit(). But printk_emit() is currently used only in
devkmsg_writev() and here "dict" is NULL as well. In dev_vprintk_emit(),
"dict" is limited by the static buffer "hdr" of the size 128 bytes. It
meas that the current maximal printed text is 1024B + 128B + 16B and it
always fit the log buffer.
But it is only matter of time when someone calls printk_emit() with unsafe
parameters, especially the "dict" one.
This patch adds a check for the free space when the buffer is empty. It
reuses the already existing log_has_space() function but it has to add an
extra parameter. It defines whether the buffer is empty. Note that the
same values of "first_idx" and "next_idx" might also mean that the buffer
is full.
If the buffer is empty, we must respect the current position of the
indexes. We cannot reset them to the beginning of the buffer. Otherwise,
the functions reading the buffer would get crazy.
The question is what to do when the message is too long. This patch uses
the easiest solution and just ignores the problematic message. Let's do
something better in a followup patch.
Signed-off-by: Petr Mladek <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jiri Kosina <[email protected]>
Cc: Kay Sievers <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The check for free space in the log buffer always passes when "first_seq"
and "next_seq" are equal. In theory, it might cause writing outside of
the log buffer.
Fortunately, the current usage looks safe because the used "text" and
"dict" buffers are quite limited. See the second patch for more details.
Anyway, it is better to be on the safe side and add a check. An easy
solution is done in the 2nd patch and it is improved in the 4th patch.
5th patch fixes the computation of the printed message length.
1st and 3rd patches just do some code refactoring to make the other
patches easier.
This patch (of 5):
There will be needed some fixes in the check for free space. They will be
easier if the code is moved outside of the quite long log_store()
function.
This patch does not change the existing behavior.
Signed-off-by: Petr Mladek <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jiri Kosina <[email protected]>
Cc: Kay Sievers <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Nobody seems uses it for a long time. Let's drop it.
Signed-off-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
sysctl_hung_task_panic has been changed to unsigned int. use kstrtouint
instead of obsolete simple_strtoul
Signed-off-by: Fabian Frederick <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Also fixes checkpatch warnings on proc_dostring function parameters
Signed-off-by: Fabian Frederick <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Replace obsolete function.
kstrtoint is used as reboot_cpu is an integer.
Signed-off-by: Fabian Frederick <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
[[email protected]: don't overwrite kstrtoull()'s errno]
Signed-off-by: Fabian Frederick <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Tejun Heo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Signed-off-by: Fabian Frederick <[email protected]>
Cc: Steven Rostedt <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Signed-off-by: Fabian Frederick <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This patch also fixes one function declaration over 80 characters.
Signed-off-by: Fabian Frederick <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Fix checkpatch warnings about EXPORT_SYMBOL and return()
Signed-off-by: Fabian Frederick <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
- EXPORT_SYMBOL
- typo: unexpectidly->unexpectedly
- function prototype over 80 characters
Signed-off-by: Fabian Frederick <[email protected]>
Cc: Serge Hallyn <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Signed-off-by: Fabian Frederick <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
no level printk converted to pr_warn (if err)
no level printk converted to pr_info (disabling non-boot cpus)
Other printk converted to respective level.
Signed-off-by: Fabian Frederick <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Usually, BUG_ON and friends aren't even evaluated in sparse, but recently
compiletime_assert_atomic_type() was added, and that now results in a
sparse warning every time it is used.
The reason turns out to be the temporary variable, after it sparse no
longer considers the value to be a constant, and results in a warning and
an error. The error is the more annoying part of this as it suppresses
any further warnings in the same file, hiding other problems.
Unfortunately the condition cannot be simply expanded out to avoid the
temporary variable since it breaks compiletime_assert on old versions of
GCC such as GCC 4.2.4 which the latest metag compiler is based on.
Therefore #ifndef __CHECKER__ out the __compiletime_error_fallback which
uses the potentially negative size array to trigger a conditional compiler
error, so that sparse doesn't see it.
Signed-off-by: James Hogan <[email protected]>
Cc: Johannes Berg <[email protected]>
Cc: Daniel Santos <[email protected]>
Cc: Luciano Coelho <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Acked-by: Johannes Berg <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Fixing 2 typo in function comments.
Signed-off-by: Fabian Frederick <[email protected]>
Cc: Al Viro <[email protected]>
Cc: "J. Bruce Fields" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
...like other filesystems.
Signed-off-by: Fabian Frederick <[email protected]>
Cc: Matthew Garrett <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
sys_sgetmask and sys_ssetmask are obsolete system calls no longer
supported in libc.
This patch replaces architecture related __ARCH_WANT_SYS_SGETMAX by expert
mode configuration.That option is enabled by default for those
architectures.
Signed-off-by: Fabian Frederick <[email protected]>
Cc: Steven Miao <[email protected]>
Cc: Mikael Starvik <[email protected]>
Cc: Jesper Nilsson <[email protected]>
Cc: David Howells <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Michal Simek <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Koichi Yasutake <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Greg Ungerer <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
zswap_dstmem is a percpu block of memory, which should be allocated using
kmalloc_node(), to get better NUMA locality.
Without it, all the blocks are allocated from a single node.
Signed-off-by: Eric Dumazet <[email protected]>
Acked-by: Seth Jennings <[email protected]>
Acked-by: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Now, we can build zsmalloc as module because unmap_kernel_range was
exported.
Signed-off-by: Minchan Kim <[email protected]>
Cc: Nitin Gupta <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: Jerome Marchand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
zsmalloc needs exported unmap_kernel_range for building as a module. See
https://lkml.org/lkml/2013/1/18/487
I didn't send a patch to make unmap_kernel_range exportable at that time
because zram was staging stuff and I thought VM function exporting for
staging stuff makes no sense.
Now zsmalloc was promoted. If we can't build zsmalloc as module, it means
we can't build zram as module, either. Additionally, buddy map_vm_area is
already exported so let's export unmap_kernel_range to help his buddy.
Signed-off-by: Minchan Kim <[email protected]>
Cc: Nitin Gupta <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: Jerome Marchand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
According to calculation, ZS_SIZE_CLASSES value is 255 on systems with 4K
page size, not 254. The old value may forget count the ZS_MIN_ALLOC_SIZE
in.
This patch fixes this trivial issue in the comments.
Signed-off-by: Weijie Yang <[email protected]>
Cc: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
zbud_alloc is only called by zswap_frontswap_store with unsigned int len.
Change function parameter + update >= 0 check.
Signed-off-by: Fabian Frederick <[email protected]>
Acked-by: Seth Jennings <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
We want to skip the physical block(PAGE_SIZE) which is partially covered
by the discard bio, so we check the remaining size and subtract it if
there is a need to goto the next physical block.
The current offset usage in zram_bio_discard is incorrect, it will cause
its upper filesystem breakdown. Consider the following scenario:
On some architecture or config, PAGE_SIZE is 64K for example, filesystem
is set up on zram disk without PAGE_SIZE aligned, a discard bio leads to a
offset = 4K and size=72K, normally, it should not really discard any
physical block as it partially cover two physical blocks. However, with
the current offset usage, it will discard the second physical block and
free its memory, which will cause filesystem breakdown.
This patch corrects the offset usage in zram_bio_discard.
Signed-off-by: Weijie Yang <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Nitin Gupta <[email protected]>
Acked-by: Joonsoo Kim <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: Bob Liu <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
mem_cgroup_force_empty_list() can iterate a large number of pages on an
lru and mem_cgroup_move_parent() doesn't return an errno unless certain
criteria, none of which indicate that the iteration may be taking too
long, is met.
We have encountered the following stack trace many times indicating
"need_resched set for > 51000020 ns (51 ticks) without schedule", for
example:
scheduler_tick()
<timer irq>
mem_cgroup_move_account+0x4d/0x1d5
mem_cgroup_move_parent+0x8d/0x109
mem_cgroup_reparent_charges+0x149/0x2ba
mem_cgroup_css_offline+0xeb/0x11b
cgroup_offline_fn+0x68/0x16b
process_one_work+0x129/0x350
If this iteration is taking too long, we still need to do cond_resched()
even when an individual page is not busy.
[[email protected]: changelog]
Signed-off-by: Hugh Dickins <[email protected]>
Signed-off-by: David Rientjes <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Existing description is worded in a way which almost encourages setting of
vfs_cache_pressure above 100, possibly way above it.
Users are left in a dark what this numeric value is - an int? a
percentage? what the scale is?
As a result, we are getting reports about noticeable performance
degradation from users who have set vfs_cache_pressure to ridiculously
high values - because they thought there is no downside to it.
Via code inspection it's obvious that this value is treated as a
percentage. This patch changes text to reflect this fact, and adds a
cautionary paragraph advising against setting vfs_cache_pressure sky high.
Signed-off-by: Denys Vlasenko <[email protected]>
Cc: Alexander Viro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
SIGBUS(BUS_MCEERR_AO)
Currently memory error handler handles action optional errors in the
deferred manner by default. And if a recovery aware application wants
to handle it immediately, it can do it by setting PF_MCE_EARLY flag.
However, such signal can be sent only to the main thread, so it's
problematic if the application wants to have a dedicated thread to
handler such signals.
So this patch adds dedicated thread support to memory error handler. We
have PF_MCE_EARLY flags for each thread separately, so with this patch
AO signal is sent to the thread with PF_MCE_EARLY flag set, not the main
thread. If you want to implement a dedicated thread, you call prctl()
to set PF_MCE_EARLY on the thread.
Memory error handler collects processes to be killed, so this patch lets
it check PF_MCE_EARLY flag on each thread in the collecting routines.
No behavioral change for all non-early kill cases.
Tony said:
: The old behavior was crazy - someone with a multithreaded process might
: well expect that if they call prctl(PF_MCE_EARLY) in just one thread, then
: that thread would see the SIGBUS with si_code = BUS_MCEERR_A0 - even if
: that thread wasn't the main thread for the process.
[[email protected]: coding-style fixes]
Signed-off-by: Naoya Horiguchi <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Cc: Kamil Iskra <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Chen Gong <[email protected]>
Cc: <[email protected]> [3.2+]
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
MF_ACTION_REQUIRED
When Linux sees an "action optional" machine check (where h/w has reported
an error that is not in the current execution path) we generally do not
want to signal a process, since most processes do not have a SIGBUS
handler - we'd just prematurely terminate the process for a problem that
they might never actually see.
task_early_kill() decides whether to consider a process - and it checks
whether this specific process has been marked for early signals with
"prctl", or if the system administrator has requested early signals for
all processes using /proc/sys/vm/memory_failure_early_kill.
But for MF_ACTION_REQUIRED case we must not defer. The error is in the
execution path of the current thread so we must send the SIGBUS
immediatley.
Fix by passing a flag argument through collect_procs*() to
task_early_kill() so it knows whether we can defer or must take action.
Signed-off-by: Tony Luck <[email protected]>
Signed-off-by: Naoya Horiguchi <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Chen Gong <[email protected]>
Cc: <[email protected]> [3.2+]
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
When a thread in a multi-threaded application hits a machine check because
of an uncorrectable error in memory - we want to send the SIGBUS with
si.si_code = BUS_MCEERR_AR to that thread. Currently we fail to do that
if the active thread is not the primary thread in the process.
collect_procs() just finds primary threads and this test:
if ((flags & MF_ACTION_REQUIRED) && t == current) {
will see that the thread we found isn't the current thread and so send a
si.si_code = BUS_MCEERR_AO to the primary (and nothing to the active
thread at this time).
We can fix this by checking whether "current" shares the same mm with the
process that collect_procs() said owned the page. If so, we send the
SIGBUS to current (with code BUS_MCEERR_AR).
Signed-off-by: Tony Luck <[email protected]>
Signed-off-by: Naoya Horiguchi <[email protected]>
Reported-by: Otto Bruggeman <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Chen Gong <[email protected]>
Cc: <[email protected]> [3.2+]
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
There is an orphaned prehistoric comment , which used to be against
get_dirty_limits(), the dawn of global_dirtyable_memory().
Back then, the implementation of get_dirty_limits() is complicated and
full of magic numbers, so this comment is necessary. But we now use the
clear and neat global_dirtyable_memory(), which renders this comment
ambiguous and useless. Remove it.
Signed-off-by: Jianyu Zhan <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
scan_swap_map()
Via commit ebc2a1a69111 ("swap: make cluster allocation per-cpu"), we
can find that all SWP_SOLIDSTATE "seek is cheap"(SSD case) has already
gone to si->cluster_info scan_swap_map_try_ssd_cluster() route. So that
the "last_in_cluster < scan_base" loop in the body of scan_swap_map()
has already become a dead code snippet, and it should have been deleted.
This patch is to delete the redundant loop as Hugh and Shaohua
suggested.
[[email protected]: fix comment, simplify code]
Signed-off-by: Chen Yucong <[email protected]>
Cc: Shaohua Li <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
We already have a function named hugepages_supported(), and the similar
name hugepage_migration_support() is a bit unconfortable, so let's rename
it hugepage_migration_supported().
Signed-off-by: Naoya Horiguchi <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Some clarification on how faultaround works.
[[email protected]: tweak comment text]
Signed-off-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
There is evidencs that the faultaround feature is less relevant on
architectures with page size bigger then 4k. Which makes sense since page
fault overhead per byte of mapped area should be less there.
Let's rework the feature to specify faultaround area in bytes instead of
page order. It's 64 kilobytes for now.
The patch effectively disables faultaround on architectures with page size
>= 64k (like ppc64).
It's possible that some other size of faultaround area is relevant for a
platform. We can expose `fault_around_bytes' variable to arch-specific
code once such platforms will be found.
Signed-off-by: Kirill A. Shutemov <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Madhavan Srinivasan <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Dave Hansen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
add_active_range() has been repalced by memblock_set_node(). Clean up the
comments to comply with that change.
Signed-off-by: Zhang Zhen <[email protected]>
Cc: Tejun Heo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Transform action part of ttu_flags into individiual bits. These flags
aren't part of any uses-space visible api or even trace events.
Signed-off-by: Konstantin Khlebnikov <[email protected]>
Cc: Rik van Riel <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
In its munmap mode, try_to_unmap_one() searches other mlocked vmas, it
never unmaps pages. There is no reason for invalidation because ptes are
left unchanged.
Signed-off-by: Konstantin Khlebnikov <[email protected]>
Cc: Rik van Riel <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
CONFIG_CROSS_MEMORY_ATTACH adds couple syscalls: process_vm_readv and
process_vm_writev, it's a kind of IPC for copying data between processes.
Currently this option is placed inside "Processor type and features".
This patch moves it into "General setup" (where all other arch-independed
syscalls and ipc features are placed) and changes prompt string to less
cryptic.
Signed-off-by: Konstantin Khlebnikov <[email protected]>
Cc: Christopher Yeoh <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Al Viro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
DEF_PRIORITY
Commit "mm: vmscan: obey proportional scanning requirements for kswapd"
ensured that file/anon lists were scanned proportionally for reclaim from
kswapd but ignored it for direct reclaim. The intent was to minimse
direct reclaim latency but Yuanhan Liu pointer out that it substitutes one
long stall for many small stalls and distorts aging for normal workloads
like streaming readers/writers. Hugh Dickins pointed out that a
side-effect of the same commit was that when one LRU list dropped to zero
that the entirety of the other list was shrunk leading to excessive
reclaim in memcgs. This patch scans the file/anon lists proportionally
for direct reclaim to similarly age page whether reclaimed by kswapd or
direct reclaim but takes care to abort reclaim if one LRU drops to zero
after reclaiming the requested number of pages.
Based on ext4 and using the Intel VM scalability test
3.15.0-rc5 3.15.0-rc5
shrinker proportion
Unit lru-file-readonce elapsed 5.3500 ( 0.00%) 5.4200 ( -1.31%)
Unit lru-file-readonce time_range 0.2700 ( 0.00%) 0.1400 ( 48.15%)
Unit lru-file-readonce time_stddv 0.1148 ( 0.00%) 0.0536 ( 53.33%)
Unit lru-file-readtwice elapsed 8.1700 ( 0.00%) 8.1700 ( 0.00%)
Unit lru-file-readtwice time_range 0.4300 ( 0.00%) 0.2300 ( 46.51%)
Unit lru-file-readtwice time_stddv 0.1650 ( 0.00%) 0.0971 ( 41.16%)
The test cases are running multiple dd instances reading sparse files. The results are within
the noise for the small test machine. The impact of the patch is more noticable from the vmstats
3.15.0-rc5 3.15.0-rc5
shrinker proportion
Minor Faults 35154 36784
Major Faults 611 1305
Swap Ins 394 1651
Swap Outs 4394 5891
Allocation stalls 118616 44781
Direct pages scanned 4935171 4602313
Kswapd pages scanned 15921292 16258483
Kswapd pages reclaimed 15913301 16248305
Direct pages reclaimed 4933368 4601133
Kswapd efficiency 99% 99%
Kswapd velocity 670088.047 682555.961
Direct efficiency 99% 99%
Direct velocity 207709.217 193212.133
Percentage direct scans 23% 22%
Page writes by reclaim 4858.000 6232.000
Page writes file 464 341
Page writes anon 4394 5891
Note that there are fewer allocation stalls even though the amount
of direct reclaim scanning is very approximately the same.
Signed-off-by: Mel Gorman <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Dave Chinner <[email protected]>
Tested-by: Yuanhan Liu <[email protected]>
Cc: Bob Liu <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Al Viro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
We remove the call to grab_super_passive in call to super_cache_count.
This becomes a scalability bottleneck as multiple threads are trying to do
memory reclamation, e.g. when we are doing large amount of file read and
page cache is under pressure. The cached objects quickly got reclaimed
down to 0 and we are aborting the cache_scan() reclaim. But counting
creates a log jam acquiring the sb_lock.
We are holding the shrinker_rwsem which ensures the safety of call to
list_lru_count_node() and s_op->nr_cached_objects. The shrinker is
unregistered now before ->kill_sb() so the operation is safe when we are
doing unmount.
The impact will depend heavily on the machine and the workload but for a
small machine using postmark tuned to use 4xRAM size the results were
3.15.0-rc5 3.15.0-rc5
vanilla shrinker-v1r1
Ops/sec Transactions 21.00 ( 0.00%) 24.00 ( 14.29%)
Ops/sec FilesCreate 39.00 ( 0.00%) 44.00 ( 12.82%)
Ops/sec CreateTransact 10.00 ( 0.00%) 12.00 ( 20.00%)
Ops/sec FilesDeleted 6202.00 ( 0.00%) 6202.00 ( 0.00%)
Ops/sec DeleteTransact 11.00 ( 0.00%) 12.00 ( 9.09%)
Ops/sec DataRead/MB 25.97 ( 0.00%) 29.10 ( 12.05%)
Ops/sec DataWrite/MB 49.99 ( 0.00%) 56.02 ( 12.06%)
ffsb running in a configuration that is meant to simulate a mail server showed
3.15.0-rc5 3.15.0-rc5
vanilla shrinker-v1r1
Ops/sec readall 9402.63 ( 0.00%) 9567.97 ( 1.76%)
Ops/sec create 4695.45 ( 0.00%) 4735.00 ( 0.84%)
Ops/sec delete 173.72 ( 0.00%) 179.83 ( 3.52%)
Ops/sec Transactions 14271.80 ( 0.00%) 14482.81 ( 1.48%)
Ops/sec Read 37.00 ( 0.00%) 37.60 ( 1.62%)
Ops/sec Write 18.20 ( 0.00%) 18.30 ( 0.55%)
Signed-off-by: Tim Chen <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Dave Chinner <[email protected]>
Tested-by: Yuanhan Liu <[email protected]>
Cc: Bob Liu <[email protected]>
Cc: Jan Kara <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Cc: Al Viro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This series is aimed at regressions noticed during reclaim activity. The
first two patches are shrinker patches that were posted ages ago but never
merged for reasons that are unclear to me. I'm posting them again to see
if there was a reason they were dropped or if they just got lost. Dave?
Time? The last patch adjusts proportional reclaim. Yuanhan Liu, can you
retest the vm scalability test cases on a larger machine? Hugh, does this
work for you on the memcg test cases?
Based on ext4, I get the following results but unfortunately my larger
test machines are all unavailable so this is based on a relatively small
machine.
postmark
3.15.0-rc5 3.15.0-rc5
vanilla proportion-v1r4
Ops/sec Transactions 21.00 ( 0.00%) 25.00 ( 19.05%)
Ops/sec FilesCreate 39.00 ( 0.00%) 45.00 ( 15.38%)
Ops/sec CreateTransact 10.00 ( 0.00%) 12.00 ( 20.00%)
Ops/sec FilesDeleted 6202.00 ( 0.00%) 6202.00 ( 0.00%)
Ops/sec DeleteTransact 11.00 ( 0.00%) 12.00 ( 9.09%)
Ops/sec DataRead/MB 25.97 ( 0.00%) 30.02 ( 15.59%)
Ops/sec DataWrite/MB 49.99 ( 0.00%) 57.78 ( 15.58%)
ffsb (mail server simulator)
3.15.0-rc5 3.15.0-rc5
vanilla proportion-v1r4
Ops/sec readall 9402.63 ( 0.00%) 9805.74 ( 4.29%)
Ops/sec create 4695.45 ( 0.00%) 4781.39 ( 1.83%)
Ops/sec delete 173.72 ( 0.00%) 177.23 ( 2.02%)
Ops/sec Transactions 14271.80 ( 0.00%) 14764.37 ( 3.45%)
Ops/sec Read 37.00 ( 0.00%) 38.50 ( 4.05%)
Ops/sec Write 18.20 ( 0.00%) 18.50 ( 1.65%)
dd of a large file
3.15.0-rc5 3.15.0-rc5
vanilla proportion-v1r4
WallTime DownloadTar 75.00 ( 0.00%) 61.00 ( 18.67%)
WallTime DD 423.00 ( 0.00%) 401.00 ( 5.20%)
WallTime Delete 2.00 ( 0.00%) 5.00 (-150.00%)
stutter (times mmap latency during large amounts of IO)
3.15.0-rc5 3.15.0-rc5
vanilla proportion-v1r4
Unit >5ms Delays 80252.0000 ( 0.00%) 81523.0000 ( -1.58%)
Unit Mmap min 8.2118 ( 0.00%) 8.3206 ( -1.33%)
Unit Mmap mean 17.4614 ( 0.00%) 17.2868 ( 1.00%)
Unit Mmap stddev 24.9059 ( 0.00%) 34.6771 (-39.23%)
Unit Mmap max 2811.6433 ( 0.00%) 2645.1398 ( 5.92%)
Unit Mmap 90% 20.5098 ( 0.00%) 18.3105 ( 10.72%)
Unit Mmap 93% 22.9180 ( 0.00%) 20.1751 ( 11.97%)
Unit Mmap 95% 25.2114 ( 0.00%) 22.4988 ( 10.76%)
Unit Mmap 99% 46.1430 ( 0.00%) 43.5952 ( 5.52%)
Unit Ideal Tput 85.2623 ( 0.00%) 78.8906 ( 7.47%)
Unit Tput min 44.0666 ( 0.00%) 43.9609 ( 0.24%)
Unit Tput mean 45.5646 ( 0.00%) 45.2009 ( 0.80%)
Unit Tput stddev 0.9318 ( 0.00%) 1.1084 (-18.95%)
Unit Tput max 46.7375 ( 0.00%) 46.7539 ( -0.04%)
This patch (of 3):
We will like to unregister the sb shrinker before ->kill_sb(). This will
allow cached objects to be counted without call to grab_super_passive() to
update ref count on sb. We want to avoid locking during memory
reclamation especially when we are skipping the memory reclaim when we are
out of cached objects.
This is safe because grab_super_passive does a try-lock on the
sb->s_umount now, and so if we are in the unmount process, it won't ever
block. That means what used to be a deadlock and races we were avoiding
by using grab_super_passive() is now:
shrinker umount
down_read(shrinker_rwsem)
down_write(sb->s_umount)
shrinker_unregister
down_write(shrinker_rwsem)
<blocks>
grab_super_passive(sb)
down_read_trylock(sb->s_umount)
<fails>
<shrinker aborts>
....
<shrinkers finish running>
up_read(shrinker_rwsem)
<unblocks>
<removes shrinker>
up_write(shrinker_rwsem)
->kill_sb()
....
So it is safe to deregister the shrinker before ->kill_sb().
Signed-off-by: Tim Chen <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Dave Chinner <[email protected]>
Tested-by: Yuanhan Liu <[email protected]>
Cc: Bob Liu <[email protected]>
Cc: Jan Kara <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Cc: Al Viro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Signed-off-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
msync() currently syncs more than POSIX requires or BSD or Solaris
implement. It is supposed to be equivalent to fdatasync(), not fsync(),
and it is only supposed to sync the portion of the file that overlaps the
range passed to msync.
If the VMA is non-linear, fall back to syncing the entire file, but we
still optimise to only fdatasync() the entire file, not the full fsync().
akpm: there are obvious concerns with bck-compatibility: is anyone relying
on the undocumented side-effect for their data integrity? And how would
they ever know if this change broke their data integrity?
We think the risk is reasonably low, and this patch brings the kernel into
line with other OS's and with what the manpage has always said...
Signed-off-by: Matthew Wilcox <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Jeff Moyer <[email protected]>
Cc: Chris Mason <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|