aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2014-06-04printk: enable interrupts before calling console_trylock_for_printk()Jan Kara1-11/+18
We need interrupts disabled when calling console_trylock_for_printk() only so that cpu id we pass to can_use_console() remains valid (for other things console_sem provides all the exclusion we need and deadlocks on console_sem due to interrupts are impossible because we use down_trylock()). However if we are rescheduled, we are guaranteed to run on an online cpu so we can easily just get the cpu id in can_use_console(). We can lose a bit of performance when we enable interrupts in vprintk_emit() and then disable them again in console_unlock() but OTOH it can somewhat reduce interrupt latency caused by console_unlock() especially since later in the patch series we will want to spin on console_sem in console_trylock_for_printk(). Signed-off-by: Jan Kara <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04printk: fix lockdep instrumentation of console_semJan Kara1-14/+32
Printk calls mutex_acquire() / mutex_release() by hand to instrument lockdep about console_sem. However in some corner cases the instrumentation is missing. Fix the problem by creating helper functions for locking / unlocking console_sem which take care of lockdep instrumentation as well. Signed-off-by: Jan Kara <[email protected]> Reported-by: Fabio Estevam <[email protected]> Reported-by: Andy Shevchenko <[email protected]> Tested-by: Fabio Estevam <[email protected]> Tested-By: Valdis Kletnieks <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Ingo Molnar <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04printk: release lockbuf_lock before calling console_trylock_for_printk()Jan Kara1-33/+21
There's no reason to hold lockbuf_lock when entering console_trylock_for_printk(). The first thing this function does is to call down_trylock(console_sem) and if that fails it immediately unlocks lockbuf_lock. So lockbuf_lock isn't needed for that branch. When down_trylock() succeeds, the rest of console_trylock() is OK without lockbuf_lock (it is called without it from other places), and the only remaining thing in console_trylock_for_printk() is can_use_console() call. For that call console_sem is enough (it iterates all consoles and checks CON_ANYTIME flag). So we drop logbuf_lock before entering console_trylock_for_printk() which simplifies the code. [[email protected]: fix have_callable_console() comment] Signed-off-by: Jan Kara <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04printk: remove outdated commentJan Kara1-2/+1
Comment about interesting interlocking between lockbuf_lock and console_sem is outdated. It was added in 2002 by commit a880f45a48be during conversion of console_lock to console_sem + lockbuf_lock. At that time release_console_sem() (today's equivalent is console_unlock()) was indeed using lockbuf_lock to avoid races between trylock on console_sem in printk() and unlock of console_sem. However these days the interlocking is gone and the races are avoided by rechecking logbuf state after releasing console_sem. Signed-off-by: Jan Kara <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04printk: return really stored message lengthPetr Mladek1-15/+21
I wonder if anyone uses printk return value but it is there and should be counted correctly. This patch modifies log_store() to return the number of really stored bytes from the 'text' part. Also it handles the return value in vprintk_emit(). Note that log_store() is used also in cont_flush() but we could ignore the return value there. The function works with characters that were already counted earlier. In addition, the store could newer fail here because the length of the printed text is limited by the "cont" buffer and "dict" is NULL. Signed-off-by: Petr Mladek <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jiri Kosina <[email protected]> Cc: Kay Sievers <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04printk: shrink too long messagesPetr Mladek1-3/+39
We might want to print at least part of too long messages and add some warning for debugging purpose. The question is how long the shrunken message should be. If we use the whole buffer, it might get rotated too soon. Let's try to use only 1/4 of the buffer for now. Also shrink the whole dictionary. We do not want to parse it or break it in the middle of some pair of values. It would not cause any real harm but still. Signed-off-by: Petr Mladek <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jiri Kosina <[email protected]> Cc: Kay Sievers <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04printk: split message size computationPetr Mladek1-3/+13
We will want to recompute the message size when shrinking too long messages. Let's put the code into separate function. The side effect of setting "pad_len" is not nice but it is worth removing the code duplication. Note that I will probably have one more usage for this function when handling messages safe way in NMI context. This patch does not change the existing behavior. Signed-off-by: Petr Mladek <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jiri Kosina <[email protected]> Cc: Kay Sievers <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04printk: ignore too long messagesPetr Mladek1-7/+23
There was no check for too long messages. The check for free space always passed when first_seq and next_seq were equal. Enough free space was not guaranteed, though. log_store() might be called to store messages up to 64kB + 64kB + 16B. This is sum of maximal text_len, dict_len values, and the size of the structure printk_log. On the other hand, the minimal size for the main log buffer currently is 4kB and it is enforced only by Kconfig. The good news is that the usage looks safe right now. log_store() is called only from vprintk_emit() and cont_flush(). Here the "text" part is always passed via a static buffer and the length is limited to LOG_LINE_MAX which is 1024. The "dict" part is NULL in most cases. The only exceptions is when vprintk_emit() is called from printk_emit() and dev_vprintk_emit(). But printk_emit() is currently used only in devkmsg_writev() and here "dict" is NULL as well. In dev_vprintk_emit(), "dict" is limited by the static buffer "hdr" of the size 128 bytes. It meas that the current maximal printed text is 1024B + 128B + 16B and it always fit the log buffer. But it is only matter of time when someone calls printk_emit() with unsafe parameters, especially the "dict" one. This patch adds a check for the free space when the buffer is empty. It reuses the already existing log_has_space() function but it has to add an extra parameter. It defines whether the buffer is empty. Note that the same values of "first_idx" and "next_idx" might also mean that the buffer is full. If the buffer is empty, we must respect the current position of the indexes. We cannot reset them to the beginning of the buffer. Otherwise, the functions reading the buffer would get crazy. The question is what to do when the message is too long. This patch uses the easiest solution and just ignores the problematic message. Let's do something better in a followup patch. Signed-off-by: Petr Mladek <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jiri Kosina <[email protected]> Cc: Kay Sievers <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04printk: split code for making free space in the log bufferPetr Mladek1-15/+29
The check for free space in the log buffer always passes when "first_seq" and "next_seq" are equal. In theory, it might cause writing outside of the log buffer. Fortunately, the current usage looks safe because the used "text" and "dict" buffers are quite limited. See the second patch for more details. Anyway, it is better to be on the safe side and add a check. An easy solution is done in the 2nd patch and it is improved in the 4th patch. 5th patch fixes the computation of the printed message length. 1st and 3rd patches just do some code refactoring to make the other patches easier. This patch (of 5): There will be needed some fixes in the check for free space. They will be easier if the code is moved outside of the quite long log_store() function. This patch does not change the existing behavior. Signed-off-by: Petr Mladek <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jiri Kosina <[email protected]> Cc: Kay Sievers <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04kernel/user.c: drop unused field 'files' from user_structKirill A. Shutemov2-2/+0
Nobody seems uses it for a long time. Let's drop it. Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04kernel/hung_task.c: convert simple_strtoul to kstrtouintFabian Frederick1-1/+3
sysctl_hung_task_panic has been changed to unsigned int. use kstrtouint instead of obsolete simple_strtoul Signed-off-by: Fabian Frederick <[email protected]> Cc: Ingo Molnar <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04kernel/utsname_sysctl.c: replace obsolete __initcall by device_initcallFabian Frederick1-2/+2
Also fixes checkpatch warnings on proc_dostring function parameters Signed-off-by: Fabian Frederick <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04kernel/reboot.c: convert simple_strtoul to kstrtointFabian Frederick1-7/+14
Replace obsolete function. kstrtoint is used as reboot_cpu is an integer. Signed-off-by: Fabian Frederick <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04kernel/res_counter.c: replace simple_strtoull by kstrtoullFabian Frederick1-2/+5
[[email protected]: don't overwrite kstrtoull()'s errno] Signed-off-by: Fabian Frederick <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04kernel/tracepoint.c: kernel-doc fixesFabian Frederick1-0/+2
Signed-off-by: Fabian Frederick <[email protected]> Cc: Steven Rostedt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04kernel/stop_machine.c: kernel-doc warning fixFabian Frederick1-0/+1
Signed-off-by: Fabian Frederick <[email protected]> Cc: Peter Zijlstra <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04kernel/latencytop.c: convert seq_printf to seq_putsFabian Frederick1-2/+3
This patch also fixes one function declaration over 80 characters. Signed-off-by: Fabian Frederick <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04kernel/exec_domain.c: code clean-upFabian Frederick1-8/+6
Fix checkpatch warnings about EXPORT_SYMBOL and return() Signed-off-by: Fabian Frederick <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04kernel/capability.c: code clean-upFabian Frederick1-3/+3
- EXPORT_SYMBOL - typo: unexpectidly->unexpectedly - function prototype over 80 characters Signed-off-by: Fabian Frederick <[email protected]> Cc: Serge Hallyn <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04kernel/backtracetest.c: replace no level printk by pr_info()Fabian Frederick1-9/+9
Signed-off-by: Fabian Frederick <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04kernel/cpu.c: convert printk to pr_foo()Fabian Frederick1-17/+14
no level printk converted to pr_warn (if err) no level printk converted to pr_info (disabling non-boot cpus) Other printk converted to respective level. Signed-off-by: Fabian Frederick <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Ingo Molnar <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04compiler.h: avoid sparse errors in __compiletime_error_fallback()James Hogan1-2/+11
Usually, BUG_ON and friends aren't even evaluated in sparse, but recently compiletime_assert_atomic_type() was added, and that now results in a sparse warning every time it is used. The reason turns out to be the temporary variable, after it sparse no longer considers the value to be a constant, and results in a warning and an error. The error is the more annoying part of this as it suppresses any further warnings in the same file, hiding other problems. Unfortunately the condition cannot be simply expanded out to avoid the temporary variable since it breaks compiletime_assert on old versions of GCC such as GCC 4.2.4 which the latest metag compiler is based on. Therefore #ifndef __CHECKER__ out the __compiletime_error_fallback which uses the potentially negative size array to trigger a conditional compiler error, so that sparse doesn't see it. Signed-off-by: James Hogan <[email protected]> Cc: Johannes Berg <[email protected]> Cc: Daniel Santos <[email protected]> Cc: Luciano Coelho <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Paul E. McKenney <[email protected]> Acked-by: Johannes Berg <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04fs/exportfs/expfs.c: kernel-doc warning fixesFabian Frederick1-2/+2
Fixing 2 typo in function comments. Signed-off-by: Fabian Frederick <[email protected]> Cc: Al Viro <[email protected]> Cc: "J. Bruce Fields" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04fs/efivarfs/super.c: use static const for dentry_operationsFabian Frederick1-1/+1
...like other filesystems. Signed-off-by: Fabian Frederick <[email protected]> Cc: Matthew Garrett <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04sys_sgetmask/sys_ssetmask: add CONFIG_SGETMASK_SYSCALLFabian Frederick15-14/+14
sys_sgetmask and sys_ssetmask are obsolete system calls no longer supported in libc. This patch replaces architecture related __ARCH_WANT_SYS_SGETMAX by expert mode configuration.That option is enabled by default for those architectures. Signed-off-by: Fabian Frederick <[email protected]> Cc: Steven Miao <[email protected]> Cc: Mikael Starvik <[email protected]> Cc: Jesper Nilsson <[email protected]> Cc: David Howells <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Michal Simek <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Koichi Yasutake <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Helge Deller <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Greg Ungerer <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Oleg Nesterov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04mm/zswap: NUMA aware allocation for zswap_dstmemEric Dumazet1-1/+1
zswap_dstmem is a percpu block of memory, which should be allocated using kmalloc_node(), to get better NUMA locality. Without it, all the blocks are allocated from a single node. Signed-off-by: Eric Dumazet <[email protected]> Acked-by: Seth Jennings <[email protected]> Acked-by: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04mm/zsmalloc: make zsmalloc module-buildableMinchan Kim1-1/+1
Now, we can build zsmalloc as module because unmap_kernel_range was exported. Signed-off-by: Minchan Kim <[email protected]> Cc: Nitin Gupta <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Jerome Marchand <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04mm/vmalloc.c: export unmap_kernel_range()Minchan Kim1-0/+1
zsmalloc needs exported unmap_kernel_range for building as a module. See https://lkml.org/lkml/2013/1/18/487 I didn't send a patch to make unmap_kernel_range exportable at that time because zram was staging stuff and I thought VM function exporting for staging stuff makes no sense. Now zsmalloc was promoted. If we can't build zsmalloc as module, it means we can't build zram as module, either. Additionally, buddy map_vm_area is already exported so let's export unmap_kernel_range to help his buddy. Signed-off-by: Minchan Kim <[email protected]> Cc: Nitin Gupta <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Jerome Marchand <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04zsmalloc: fixup trivial zs size classes value in commentsWeijie Yang1-1/+1
According to calculation, ZS_SIZE_CLASSES value is 255 on systems with 4K page size, not 254. The old value may forget count the ZS_MIN_ALLOC_SIZE in. This patch fixes this trivial issue in the comments. Signed-off-by: Weijie Yang <[email protected]> Cc: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04mm/zbud.c: make size unsigned like unique callsiteFabian Frederick2-3/+3
zbud_alloc is only called by zswap_frontswap_store with unsigned int len. Change function parameter + update >= 0 check. Signed-off-by: Fabian Frederick <[email protected]> Acked-by: Seth Jennings <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04zram: correct offset usage in zram_bio_discardWeijie Yang1-2/+2
We want to skip the physical block(PAGE_SIZE) which is partially covered by the discard bio, so we check the remaining size and subtract it if there is a need to goto the next physical block. The current offset usage in zram_bio_discard is incorrect, it will cause its upper filesystem breakdown. Consider the following scenario: On some architecture or config, PAGE_SIZE is 64K for example, filesystem is set up on zram disk without PAGE_SIZE aligned, a discard bio leads to a offset = 4K and size=72K, normally, it should not really discard any physical block as it partially cover two physical blocks. However, with the current offset usage, it will discard the second physical block and free its memory, which will cause filesystem breakdown. This patch corrects the offset usage in zram_bio_discard. Signed-off-by: Weijie Yang <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Nitin Gupta <[email protected]> Acked-by: Joonsoo Kim <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Bob Liu <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04mm, memcg: periodically schedule when emptying page listHugh Dickins1-1/+1
mem_cgroup_force_empty_list() can iterate a large number of pages on an lru and mem_cgroup_move_parent() doesn't return an errno unless certain criteria, none of which indicate that the iteration may be taking too long, is met. We have encountered the following stack trace many times indicating "need_resched set for > 51000020 ns (51 ticks) without schedule", for example: scheduler_tick() <timer irq> mem_cgroup_move_account+0x4d/0x1d5 mem_cgroup_move_parent+0x8d/0x109 mem_cgroup_reparent_charges+0x149/0x2ba mem_cgroup_css_offline+0xeb/0x11b cgroup_offline_fn+0x68/0x16b process_one_work+0x129/0x350 If this iteration is taking too long, we still need to do cond_resched() even when an individual page is not busy. [[email protected]: changelog] Signed-off-by: Hugh Dickins <[email protected]> Signed-off-by: David Rientjes <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04Documentation/sysctl/vm.txt: clarify vfs_cache_pressure descriptionDenys Vlasenko1-2/+7
Existing description is worded in a way which almost encourages setting of vfs_cache_pressure above 100, possibly way above it. Users are left in a dark what this numeric value is - an int? a percentage? what the scale is? As a result, we are getting reports about noticeable performance degradation from users who have set vfs_cache_pressure to ridiculously high values - because they thought there is no downside to it. Via code inspection it's obvious that this value is treated as a percentage. This patch changes text to reflect this fact, and adds a cautionary paragraph advising against setting vfs_cache_pressure sky high. Signed-off-by: Denys Vlasenko <[email protected]> Cc: Alexander Viro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04mm/memory-failure.c: support use of a dedicated thread to handle ↵Naoya Horiguchi2-13/+48
SIGBUS(BUS_MCEERR_AO) Currently memory error handler handles action optional errors in the deferred manner by default. And if a recovery aware application wants to handle it immediately, it can do it by setting PF_MCE_EARLY flag. However, such signal can be sent only to the main thread, so it's problematic if the application wants to have a dedicated thread to handler such signals. So this patch adds dedicated thread support to memory error handler. We have PF_MCE_EARLY flags for each thread separately, so with this patch AO signal is sent to the thread with PF_MCE_EARLY flag set, not the main thread. If you want to implement a dedicated thread, you call prctl() to set PF_MCE_EARLY on the thread. Memory error handler collects processes to be killed, so this patch lets it check PF_MCE_EARLY flag on each thread in the collecting routines. No behavioral change for all non-early kill cases. Tony said: : The old behavior was crazy - someone with a multithreaded process might : well expect that if they call prctl(PF_MCE_EARLY) in just one thread, then : that thread would see the SIGBUS with si_code = BUS_MCEERR_A0 - even if : that thread wasn't the main thread for the process. [[email protected]: coding-style fixes] Signed-off-by: Naoya Horiguchi <[email protected]> Reviewed-by: Tony Luck <[email protected]> Cc: Kamil Iskra <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Chen Gong <[email protected]> Cc: <[email protected]> [3.2+] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04mm/memory-failure.c: don't let collect_procs() skip over processes for ↵Tony Luck1-9/+12
MF_ACTION_REQUIRED When Linux sees an "action optional" machine check (where h/w has reported an error that is not in the current execution path) we generally do not want to signal a process, since most processes do not have a SIGBUS handler - we'd just prematurely terminate the process for a problem that they might never actually see. task_early_kill() decides whether to consider a process - and it checks whether this specific process has been marked for early signals with "prctl", or if the system administrator has requested early signals for all processes using /proc/sys/vm/memory_failure_early_kill. But for MF_ACTION_REQUIRED case we must not defer. The error is in the execution path of the current thread so we must send the SIGBUS immediatley. Fix by passing a flag argument through collect_procs*() to task_early_kill() so it knows whether we can defer or must take action. Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Naoya Horiguchi <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Chen Gong <[email protected]> Cc: <[email protected]> [3.2+] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04mm/memory-failure.c-failure: send right signal code to correct threadTony Luck1-2/+2
When a thread in a multi-threaded application hits a machine check because of an uncorrectable error in memory - we want to send the SIGBUS with si.si_code = BUS_MCEERR_AR to that thread. Currently we fail to do that if the active thread is not the primary thread in the process. collect_procs() just finds primary threads and this test: if ((flags & MF_ACTION_REQUIRED) && t == current) { will see that the thread we found isn't the current thread and so send a si.si_code = BUS_MCEERR_AO to the primary (and nothing to the active thread at this time). We can fix this by checking whether "current" shares the same mm with the process that collect_procs() said owned the page. If so, we send the SIGBUS to current (with code BUS_MCEERR_AR). Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Naoya Horiguchi <[email protected]> Reported-by: Otto Bruggeman <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Chen Gong <[email protected]> Cc: <[email protected]> [3.2+] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04mm/page-writeback.c: remove outdated commentJianyu Zhan1-18/+0
There is an orphaned prehistoric comment , which used to be against get_dirty_limits(), the dawn of global_dirtyable_memory(). Back then, the implementation of get_dirty_limits() is complicated and full of magic numbers, so this comment is necessary. But we now use the clear and neat global_dirtyable_memory(), which renders this comment ambiguous and useless. Remove it. Signed-off-by: Jianyu Zhan <[email protected]> Acked-by: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04mm/swapfile.c: delete the "last_in_cluster < scan_base" loop in the body of ↵Chen Yucong1-26/+3
scan_swap_map() Via commit ebc2a1a69111 ("swap: make cluster allocation per-cpu"), we can find that all SWP_SOLIDSTATE "seek is cheap"(SSD case) has already gone to si->cluster_info scan_swap_map_try_ssd_cluster() route. So that the "last_in_cluster < scan_base" loop in the body of scan_swap_map() has already become a dead code snippet, and it should have been deleted. This patch is to delete the redundant loop as Hugh and Shaohua suggested. [[email protected]: fix comment, simplify code] Signed-off-by: Chen Yucong <[email protected]> Cc: Shaohua Li <[email protected]> Acked-by: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04hugetlb: rename hugepage_migration_support() to ..._supported()Naoya Horiguchi3-4/+4
We already have a function named hugepages_supported(), and the similar name hugepage_migration_support() is a bit unconfortable, so let's rename it hugepage_migration_supported(). Signed-off-by: Naoya Horiguchi <[email protected]> Acked-by: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04mm: document do_fault_around() featureKirill A. Shutemov1-0/+27
Some clarification on how faultaround works. [[email protected]: tweak comment text] Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04mm: nominate faultaround area in bytes rather than page orderKirill A. Shutemov1-39/+23
There is evidencs that the faultaround feature is less relevant on architectures with page size bigger then 4k. Which makes sense since page fault overhead per byte of mapped area should be less there. Let's rework the feature to specify faultaround area in bytes instead of page order. It's 64 kilobytes for now. The patch effectively disables faultaround on architectures with page size >= 64k (like ppc64). It's possible that some other size of faultaround area is relevant for a platform. We can expose `fault_around_bytes' variable to arch-specific code once such platforms will be found. Signed-off-by: Kirill A. Shutemov <[email protected]> Cc: Rusty Russell <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Madhavan Srinivasan <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Dave Hansen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04mm/page_alloc.c: cleanup add_active_range() related commentsZhang Zhen1-13/+8
add_active_range() has been repalced by memblock_set_node(). Clean up the comments to comply with that change. Signed-off-by: Zhang Zhen <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04mm/rmap.c: cleanup ttu_flagsKonstantin Khlebnikov2-9/+8
Transform action part of ttu_flags into individiual bits. These flags aren't part of any uses-space visible api or even trace events. Signed-off-by: Konstantin Khlebnikov <[email protected]> Cc: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04mm/rmap.c: don't call mmu_notifier_invalidate_page() during munlockKonstantin Khlebnikov1-1/+1
In its munmap mode, try_to_unmap_one() searches other mlocked vmas, it never unmaps pages. There is no reason for invalidation because ptes are left unchanged. Signed-off-by: Konstantin Khlebnikov <[email protected]> Cc: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04mm/process_vm_access: move config option into init/KconfigKonstantin Khlebnikov2-10/+10
CONFIG_CROSS_MEMORY_ATTACH adds couple syscalls: process_vm_readv and process_vm_writev, it's a kind of IPC for copying data between processes. Currently this option is placed inside "Processor type and features". This patch moves it into "General setup" (where all other arch-independed syscalls and ipc features are placed) and changes prompt string to less cryptic. Signed-off-by: Konstantin Khlebnikov <[email protected]> Cc: Christopher Yeoh <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Al Viro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04mm: vmscan: use proportional scanning during direct reclaim and full scan at ↵Mel Gorman1-11/+25
DEF_PRIORITY Commit "mm: vmscan: obey proportional scanning requirements for kswapd" ensured that file/anon lists were scanned proportionally for reclaim from kswapd but ignored it for direct reclaim. The intent was to minimse direct reclaim latency but Yuanhan Liu pointer out that it substitutes one long stall for many small stalls and distorts aging for normal workloads like streaming readers/writers. Hugh Dickins pointed out that a side-effect of the same commit was that when one LRU list dropped to zero that the entirety of the other list was shrunk leading to excessive reclaim in memcgs. This patch scans the file/anon lists proportionally for direct reclaim to similarly age page whether reclaimed by kswapd or direct reclaim but takes care to abort reclaim if one LRU drops to zero after reclaiming the requested number of pages. Based on ext4 and using the Intel VM scalability test 3.15.0-rc5 3.15.0-rc5 shrinker proportion Unit lru-file-readonce elapsed 5.3500 ( 0.00%) 5.4200 ( -1.31%) Unit lru-file-readonce time_range 0.2700 ( 0.00%) 0.1400 ( 48.15%) Unit lru-file-readonce time_stddv 0.1148 ( 0.00%) 0.0536 ( 53.33%) Unit lru-file-readtwice elapsed 8.1700 ( 0.00%) 8.1700 ( 0.00%) Unit lru-file-readtwice time_range 0.4300 ( 0.00%) 0.2300 ( 46.51%) Unit lru-file-readtwice time_stddv 0.1650 ( 0.00%) 0.0971 ( 41.16%) The test cases are running multiple dd instances reading sparse files. The results are within the noise for the small test machine. The impact of the patch is more noticable from the vmstats 3.15.0-rc5 3.15.0-rc5 shrinker proportion Minor Faults 35154 36784 Major Faults 611 1305 Swap Ins 394 1651 Swap Outs 4394 5891 Allocation stalls 118616 44781 Direct pages scanned 4935171 4602313 Kswapd pages scanned 15921292 16258483 Kswapd pages reclaimed 15913301 16248305 Direct pages reclaimed 4933368 4601133 Kswapd efficiency 99% 99% Kswapd velocity 670088.047 682555.961 Direct efficiency 99% 99% Direct velocity 207709.217 193212.133 Percentage direct scans 23% 22% Page writes by reclaim 4858.000 6232.000 Page writes file 464 341 Page writes anon 4394 5891 Note that there are fewer allocation stalls even though the amount of direct reclaim scanning is very approximately the same. Signed-off-by: Mel Gorman <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Tim Chen <[email protected]> Cc: Dave Chinner <[email protected]> Tested-by: Yuanhan Liu <[email protected]> Cc: Bob Liu <[email protected]> Cc: Jan Kara <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Al Viro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04fs/superblock: avoid locking counting inodes and dentries before reclaiming themTim Chen1-4/+8
We remove the call to grab_super_passive in call to super_cache_count. This becomes a scalability bottleneck as multiple threads are trying to do memory reclamation, e.g. when we are doing large amount of file read and page cache is under pressure. The cached objects quickly got reclaimed down to 0 and we are aborting the cache_scan() reclaim. But counting creates a log jam acquiring the sb_lock. We are holding the shrinker_rwsem which ensures the safety of call to list_lru_count_node() and s_op->nr_cached_objects. The shrinker is unregistered now before ->kill_sb() so the operation is safe when we are doing unmount. The impact will depend heavily on the machine and the workload but for a small machine using postmark tuned to use 4xRAM size the results were 3.15.0-rc5 3.15.0-rc5 vanilla shrinker-v1r1 Ops/sec Transactions 21.00 ( 0.00%) 24.00 ( 14.29%) Ops/sec FilesCreate 39.00 ( 0.00%) 44.00 ( 12.82%) Ops/sec CreateTransact 10.00 ( 0.00%) 12.00 ( 20.00%) Ops/sec FilesDeleted 6202.00 ( 0.00%) 6202.00 ( 0.00%) Ops/sec DeleteTransact 11.00 ( 0.00%) 12.00 ( 9.09%) Ops/sec DataRead/MB 25.97 ( 0.00%) 29.10 ( 12.05%) Ops/sec DataWrite/MB 49.99 ( 0.00%) 56.02 ( 12.06%) ffsb running in a configuration that is meant to simulate a mail server showed 3.15.0-rc5 3.15.0-rc5 vanilla shrinker-v1r1 Ops/sec readall 9402.63 ( 0.00%) 9567.97 ( 1.76%) Ops/sec create 4695.45 ( 0.00%) 4735.00 ( 0.84%) Ops/sec delete 173.72 ( 0.00%) 179.83 ( 3.52%) Ops/sec Transactions 14271.80 ( 0.00%) 14482.81 ( 1.48%) Ops/sec Read 37.00 ( 0.00%) 37.60 ( 1.62%) Ops/sec Write 18.20 ( 0.00%) 18.30 ( 0.55%) Signed-off-by: Tim Chen <[email protected]> Signed-off-by: Mel Gorman <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Dave Chinner <[email protected]> Tested-by: Yuanhan Liu <[email protected]> Cc: Bob Liu <[email protected]> Cc: Jan Kara <[email protected]> Acked-by: Rik van Riel <[email protected]> Cc: Al Viro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04fs/superblock: unregister sb shrinker before ->kill_sb()Dave Chinner1-3/+1
This series is aimed at regressions noticed during reclaim activity. The first two patches are shrinker patches that were posted ages ago but never merged for reasons that are unclear to me. I'm posting them again to see if there was a reason they were dropped or if they just got lost. Dave? Time? The last patch adjusts proportional reclaim. Yuanhan Liu, can you retest the vm scalability test cases on a larger machine? Hugh, does this work for you on the memcg test cases? Based on ext4, I get the following results but unfortunately my larger test machines are all unavailable so this is based on a relatively small machine. postmark 3.15.0-rc5 3.15.0-rc5 vanilla proportion-v1r4 Ops/sec Transactions 21.00 ( 0.00%) 25.00 ( 19.05%) Ops/sec FilesCreate 39.00 ( 0.00%) 45.00 ( 15.38%) Ops/sec CreateTransact 10.00 ( 0.00%) 12.00 ( 20.00%) Ops/sec FilesDeleted 6202.00 ( 0.00%) 6202.00 ( 0.00%) Ops/sec DeleteTransact 11.00 ( 0.00%) 12.00 ( 9.09%) Ops/sec DataRead/MB 25.97 ( 0.00%) 30.02 ( 15.59%) Ops/sec DataWrite/MB 49.99 ( 0.00%) 57.78 ( 15.58%) ffsb (mail server simulator) 3.15.0-rc5 3.15.0-rc5 vanilla proportion-v1r4 Ops/sec readall 9402.63 ( 0.00%) 9805.74 ( 4.29%) Ops/sec create 4695.45 ( 0.00%) 4781.39 ( 1.83%) Ops/sec delete 173.72 ( 0.00%) 177.23 ( 2.02%) Ops/sec Transactions 14271.80 ( 0.00%) 14764.37 ( 3.45%) Ops/sec Read 37.00 ( 0.00%) 38.50 ( 4.05%) Ops/sec Write 18.20 ( 0.00%) 18.50 ( 1.65%) dd of a large file 3.15.0-rc5 3.15.0-rc5 vanilla proportion-v1r4 WallTime DownloadTar 75.00 ( 0.00%) 61.00 ( 18.67%) WallTime DD 423.00 ( 0.00%) 401.00 ( 5.20%) WallTime Delete 2.00 ( 0.00%) 5.00 (-150.00%) stutter (times mmap latency during large amounts of IO) 3.15.0-rc5 3.15.0-rc5 vanilla proportion-v1r4 Unit >5ms Delays 80252.0000 ( 0.00%) 81523.0000 ( -1.58%) Unit Mmap min 8.2118 ( 0.00%) 8.3206 ( -1.33%) Unit Mmap mean 17.4614 ( 0.00%) 17.2868 ( 1.00%) Unit Mmap stddev 24.9059 ( 0.00%) 34.6771 (-39.23%) Unit Mmap max 2811.6433 ( 0.00%) 2645.1398 ( 5.92%) Unit Mmap 90% 20.5098 ( 0.00%) 18.3105 ( 10.72%) Unit Mmap 93% 22.9180 ( 0.00%) 20.1751 ( 11.97%) Unit Mmap 95% 25.2114 ( 0.00%) 22.4988 ( 10.76%) Unit Mmap 99% 46.1430 ( 0.00%) 43.5952 ( 5.52%) Unit Ideal Tput 85.2623 ( 0.00%) 78.8906 ( 7.47%) Unit Tput min 44.0666 ( 0.00%) 43.9609 ( 0.24%) Unit Tput mean 45.5646 ( 0.00%) 45.2009 ( 0.80%) Unit Tput stddev 0.9318 ( 0.00%) 1.1084 (-18.95%) Unit Tput max 46.7375 ( 0.00%) 46.7539 ( -0.04%) This patch (of 3): We will like to unregister the sb shrinker before ->kill_sb(). This will allow cached objects to be counted without call to grab_super_passive() to update ref count on sb. We want to avoid locking during memory reclamation especially when we are skipping the memory reclaim when we are out of cached objects. This is safe because grab_super_passive does a try-lock on the sb->s_umount now, and so if we are in the unmount process, it won't ever block. That means what used to be a deadlock and races we were avoiding by using grab_super_passive() is now: shrinker umount down_read(shrinker_rwsem) down_write(sb->s_umount) shrinker_unregister down_write(shrinker_rwsem) <blocks> grab_super_passive(sb) down_read_trylock(sb->s_umount) <fails> <shrinker aborts> .... <shrinkers finish running> up_read(shrinker_rwsem) <unblocks> <removes shrinker> up_write(shrinker_rwsem) ->kill_sb() .... So it is safe to deregister the shrinker before ->kill_sb(). Signed-off-by: Tim Chen <[email protected]> Signed-off-by: Mel Gorman <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Dave Chinner <[email protected]> Tested-by: Yuanhan Liu <[email protected]> Cc: Bob Liu <[email protected]> Cc: Jan Kara <[email protected]> Acked-by: Rik van Riel <[email protected]> Cc: Al Viro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04mm: fix typo in comment in do_fault_around()Kirill A. Shutemov1-1/+1
Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04mm/msync.c: sync only the requested range in msync()Matthew Wilcox1-1/+7
msync() currently syncs more than POSIX requires or BSD or Solaris implement. It is supposed to be equivalent to fdatasync(), not fsync(), and it is only supposed to sync the portion of the file that overlaps the range passed to msync. If the VMA is non-linear, fall back to syncing the entire file, but we still optimise to only fdatasync() the entire file, not the full fsync(). akpm: there are obvious concerns with bck-compatibility: is anyone relying on the undocumented side-effect for their data integrity? And how would they ever know if this change broke their data integrity? We think the risk is reasonably low, and this patch brings the kernel into line with other OS's and with what the manpage has always said... Signed-off-by: Matthew Wilcox <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Acked-by: Jeff Moyer <[email protected]> Cc: Chris Mason <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>