aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2018-08-22s/epoll: robustify irq safety with lockdep_assert_irqs_enabled()Davidlohr Bueso1-0/+8
Sprinkle lockdep_assert_irqs_enabled() checks in the functions that do not save and restore interrupts when dealing with the ep->wq.lock. These are ep_scan_ready_list() and those called by epoll_ctl(): ep_insert, ep_modify and ep_remove. [[email protected]: remove too-obvious comments] Link: http://lkml.kernel.org/r/20180721183127.3busfa335zlcjeox@linux-r8p5 Signed-off-by: Davidlohr Bueso <[email protected]> Cc: Peter Zijlstra <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22fs/epoll: loosen irq safety in epoll_insert() and epoll_remove()Davidlohr Bueso1-8/+6
Both functions are similar to the context of ep_modify(), called via epoll_ctl(2). Just like ep_modify(), saving and restoring interrupts is an overkill in these calls as it will never be called with irqs disabled. While ep_remove() can be called directly from EPOLL_CTL_DEL, it can also be called when releasing the file, but this also complies with the above. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Davidlohr Bueso <[email protected]> Cc: Jason Baron <[email protected]> Cc: Al Viro <[email protected]> Cc: Peter Zijlstra <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22fs/epoll: loosen irq safety in ep_scan_ready_list()Davidlohr Bueso1-5/+4
Patch series "fs/epoll: loosen irq safety when possible". Both patches replace saving+restoring interrupts when taking the ep->lock (now the waitqueue lock), with just disabling local irqs. This shows immediate performance benefits in patch 1 for an epoll workload running on Xen. The main concern we need to have with this sort of changes in epoll is the ep_poll_callback() which is passed to the wait queue wakeup and is done very often under irq context, this patch does not touch this call. Patches have been tested pretty heavily with the customer workload, microbenchmarks, ltp testcases and two high level workloads that use epoll under the hood: nginx and libevent benchmarks. This patch (of 2): Saving and restoring interrupts in ep_scan_ready_list() is an overkill as it is never called with interrupts disabled. Loosen this to simply disabling local irqs such that archs where managing irqs is expensive or virtual environments. This patch yields some throughput improvements on a workload that is epoll intensive running on a single Xen DomU. 1 Job 7500 --> 8800 enq/s (+17%) 2 Jobs 14000 --> 15200 enq/s (+8%) 3 Jobs 20500 --> 22300 enq/s (+8%) 4 Jobs 25000 --> 28000 enq/s (+8-12)% On bare metal: For a 2-socket 40-core (ht) IvyBridge on a few workloads, unfortunately I don't have a xen environment and the results for Xen I do have (which numbers are in patch 1) I don't have the actual workload, so cannot compare them directly. 1) Different configurations were used for a epoll_wait (pipes io) microbench (http://linux-scalability.org/epoll/epoll-test.c) and shows around a 7-10% improvement in overall total number of times the epoll_wait() loops when using both regular and nested epolls, so very raw numbers, but measurable nonetheless. # threads vanilla dirty 1 1677717 1805587 2 1660510 1854064 4 1610184 1805484 8 1577696 1751222 16 1568837 1725299 32 1291532 1378463 64 752584 787368 Note that stddev is pretty small. 2) Another pipe test, which shows no real measurable improvement. (http://www.xmailserver.org/linux-patches/pipetest.c) Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Davidlohr Bueso <[email protected]> Cc: Jason Baron <[email protected]> Cc: Al Viro <[email protected]> Cc: Peter Zijlstra <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22sched/wait: assert the wait_queue_head lock is held in __wake_up_commonChristoph Hellwig1-0/+2
Better ensure we actually hold the lock using lockdep than just commenting on it. Due to the various exported _locked interfaces it is far too easy to get the locking wrong. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Acked-by: Ingo Molnar <[email protected]> Cc: Al Viro <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jason Baron <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Davidlohr Bueso <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22userfaultfd: use fault_wqh lockMatthew Wilcox1-3/+3
The userfaultfd code currently uses the unlocked waitqueue helpers for managing fault_wqh, but instead of holding the waitqueue lock for this waitqueue around these calls, it the waitqueue lock of fault_pending_wq, which is a different waitqueue instance. Given that the waitqueue is not exposed to the rest of the kernel this actually works ok at the moment, but prevents the userfaultfd locking rules from being enforced using lockdep. Switch to the internally locked waitqueue helpers instead. This means that the lock inside fault_wqh now nests inside the fault_pending_wqh lock, but that's not a problem since it was entirely unused before. [[email protected]: slight changelog updates] [[email protected]: spotted changelog spellos] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Mike Rapoport <[email protected]> Cc: Al Viro <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jason Baron <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22epoll: use the waitqueue lock to protect ep->wqChristoph Hellwig1-36/+29
Patch series "waitqueue lockdep annotation", v3. This series adds a strategic lockdep_assert_held to __wake_up_common to ensure callers really do hold the wait_queue_head lock when calling the unlocked wake_up variants. It turns out epoll did not do this for a fairly common path (hit all the time by systemd during bootup), so the second patch fixed this instance as well. This patch (of 3): The epoll code currently uses the unlocked waitqueue helpers for managing ep->wq, but instead of holding the waitqueue lock around these calls, it uses its own ep->lock spinlock. Given that the waitqueue is not exposed to the rest of the kernel this actually works ok at the moment, but prevents the epoll locking rules from being enforced using lockdep. Remove ep->lock and use the waitqueue lock to not only reduce the size of struct eventpoll but also to make sure we can assert locking invariants in the waitqueue code. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Jason Baron <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Al Viro <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Jason Baron <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Davidlohr Bueso <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22kernel: tracepoints: add support for relative referencesArd Biesheuvel2-27/+41
To avoid the need for relocating absolute references to tracepoint structures at boot time when running relocatable kernels (which may take a disproportionate amount of space), add the option to emit these tables as relative references instead. Link: http://lkml.kernel.org/r/[email protected] Acked-by: Michael Ellerman <[email protected]> Acked-by: Ingo Molnar <[email protected]> Acked-by: Steven Rostedt (VMware) <[email protected]> Signed-off-by: Ard Biesheuvel <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Bjorn Helgaas <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: James Morris <[email protected]> Cc: James Morris <[email protected]> Cc: Jessica Yu <[email protected]> Cc: Josh Poimboeuf <[email protected]> Cc: Kees Cook <[email protected]> Cc: Nicolas Pitre <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Petr Mladek <[email protected]> Cc: Russell King <[email protected]> Cc: "Serge E. Hallyn" <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Thomas Garnier <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22PCI: Add support for relative addressing in quirk tablesArd Biesheuvel2-3/+29
Allow the PCI quirk tables to be emitted in a way that avoids absolute references to the hook functions. This reduces the size of the entries, and, more importantly, makes them invariant under runtime relocation (e.g., for KASLR) Link: http://lkml.kernel.org/r/[email protected] Acked-by: Bjorn Helgaas <[email protected]> Acked-by: Michael Ellerman <[email protected]> Acked-by: Ingo Molnar <[email protected]> Signed-off-by: Ard Biesheuvel <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: James Morris <[email protected]> Cc: James Morris <[email protected]> Cc: Jessica Yu <[email protected]> Cc: Josh Poimboeuf <[email protected]> Cc: Kees Cook <[email protected]> Cc: Nicolas Pitre <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Petr Mladek <[email protected]> Cc: Russell King <[email protected]> Cc: "Serge E. Hallyn" <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Thomas Garnier <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22init: allow initcall tables to be emitted using relative referencesArd Biesheuvel4-41/+68
Allow the initcall tables to be emitted using relative references that are only half the size on 64-bit architectures and don't require fixups at runtime on relocatable kernels. Link: http://lkml.kernel.org/r/[email protected] Acked-by: James Morris <[email protected]> Acked-by: Sergey Senozhatsky <[email protected]> Acked-by: Petr Mladek <[email protected]> Acked-by: Michael Ellerman <[email protected]> Acked-by: Ingo Molnar <[email protected]> Signed-off-by: Ard Biesheuvel <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Bjorn Helgaas <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: James Morris <[email protected]> Cc: Jessica Yu <[email protected]> Cc: Josh Poimboeuf <[email protected]> Cc: Kees Cook <[email protected]> Cc: Nicolas Pitre <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Russell King <[email protected]> Cc: "Serge E. Hallyn" <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Thomas Garnier <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22module: use relative references for __ksymtab entriesArd Biesheuvel6-24/+91
An ordinary arm64 defconfig build has ~64 KB worth of __ksymtab entries, each consisting of two 64-bit fields containing absolute references, to the symbol itself and to a char array containing its name, respectively. When we build the same configuration with KASLR enabled, we end up with an additional ~192 KB of relocations in the .init section, i.e., one 24 byte entry for each absolute reference, which all need to be processed at boot time. Given how the struct kernel_symbol that describes each entry is completely local to module.c (except for the references emitted by EXPORT_SYMBOL() itself), we can easily modify it to contain two 32-bit relative references instead. This reduces the size of the __ksymtab section by 50% for all 64-bit architectures, and gets rid of the runtime relocations entirely for architectures implementing KASLR, either via standard PIE linking (arm64) or using custom host tools (x86). Note that the binary search involving __ksymtab contents relies on each section being sorted by symbol name. This is implemented based on the input section names, not the names in the ksymtab entries, so this patch does not interfere with that. Given that the use of place-relative relocations requires support both in the toolchain and in the module loader, we cannot enable this feature for all architectures. So make it dependent on whether CONFIG_HAVE_ARCH_PREL32_RELOCATIONS is defined. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ard Biesheuvel <[email protected]> Acked-by: Jessica Yu <[email protected]> Acked-by: Michael Ellerman <[email protected]> Reviewed-by: Will Deacon <[email protected]> Acked-by: Ingo Molnar <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Bjorn Helgaas <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: James Morris <[email protected]> Cc: James Morris <[email protected]> Cc: Josh Poimboeuf <[email protected]> Cc: Kees Cook <[email protected]> Cc: Nicolas Pitre <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Petr Mladek <[email protected]> Cc: Russell King <[email protected]> Cc: "Serge E. Hallyn" <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Thomas Garnier <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22module: allow symbol exports to be disabledArd Biesheuvel3-5/+12
To allow existing C code to be incorporated into the decompressor or the UEFI stub, introduce a CPP macro that turns all EXPORT_SYMBOL_xxx declarations into nops, and #define it in places where such exports are undesirable. Note that this gets rid of a rather dodgy redefine of linux/export.h's header guard. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ard Biesheuvel <[email protected]> Acked-by: Nicolas Pitre <[email protected]> Acked-by: Michael Ellerman <[email protected]> Reviewed-by: Will Deacon <[email protected]> Acked-by: Ingo Molnar <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Bjorn Helgaas <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: James Morris <[email protected]> Cc: James Morris <[email protected]> Cc: Jessica Yu <[email protected]> Cc: Josh Poimboeuf <[email protected]> Cc: Kees Cook <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Petr Mladek <[email protected]> Cc: Russell King <[email protected]> Cc: "Serge E. Hallyn" <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Thomas Garnier <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22arch: enable relative relocations for arm64, power and x86Ard Biesheuvel4-0/+13
Patch series "add support for relative references in special sections", v10. This adds support for emitting special sections such as initcall arrays, PCI fixups and tracepoints as relative references rather than absolute references. This reduces the size by 50% on 64-bit architectures, but more importantly, it removes the need for carrying relocation metadata for these sections in relocatable kernels (e.g., for KASLR) that needs to be fixed up at boot time. On arm64, this reduces the vmlinux footprint of such a reference by 8x (8 byte absolute reference + 24 byte RELA entry vs 4 byte relative reference) Patch #3 was sent out before as a single patch. This series supersedes the previous submission. This version makes relative ksymtab entries dependent on the new Kconfig symbol HAVE_ARCH_PREL32_RELOCATIONS rather than trying to infer from kbuild test robot replies for which architectures it should be blacklisted. Patch #1 introduces the new Kconfig symbol HAVE_ARCH_PREL32_RELOCATIONS, and sets it for the main architectures that are expected to benefit the most from this feature, i.e., 64-bit architectures or ones that use runtime relocations. Patch #2 add support for #define'ing __DISABLE_EXPORTS to get rid of ksymtab/kcrctab sections in decompressor and EFI stub objects when rebuilding existing C files to run in a different context. Patches #4 - #6 implement relative references for initcalls, PCI fixups and tracepoints, respectively, all of which produce sections with order ~1000 entries on an arm64 defconfig kernel with tracing enabled. This means we save about 28 KB of vmlinux space for each of these patches. [From the v7 series blurb, which included the jump_label patches as well]: For the arm64 kernel, all patches combined reduce the memory footprint of vmlinux by about 1.3 MB (using a config copied from Ubuntu that has KASLR enabled), of which ~1 MB is the size reduction of the RELA section in .init, and the remaining 300 KB is reduction of .text/.data. This patch (of 6): Before updating certain subsystems to use place relative 32-bit relocations in special sections, to save space and reduce the number of absolute relocations that need to be processed at runtime by relocatable kernels, introduce the Kconfig symbol and define it for some architectures that should be able to support and benefit from it. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ard Biesheuvel <[email protected]> Acked-by: Michael Ellerman <[email protected]> Reviewed-by: Will Deacon <[email protected]> Acked-by: Ingo Molnar <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Kees Cook <[email protected]> Cc: Thomas Garnier <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: "Serge E. Hallyn" <[email protected]> Cc: Bjorn Helgaas <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Russell King <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Petr Mladek <[email protected]> Cc: James Morris <[email protected]> Cc: Nicolas Pitre <[email protected]> Cc: Josh Poimboeuf <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Sergey Senozhatsky <[email protected]>, Cc: James Morris <[email protected]> Cc: Jessica Yu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22spelling.txt: add more spellings to spelling.txtColin Ian King1-0/+88
Here are some of the more common spelling mistakes and typos that I've found while fixing up spelling mistakes in the kernel over the past 6 months. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Colin Ian King <[email protected]> Cc: Joe Perches <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22kernel/hung_task.c: allow to set checking interval separately from timeoutDmitry Vyukov6-3/+43
Currently task hung checking interval is equal to timeout, as the result hung is detected anywhere between timeout and 2*timeout. This is fine for most interactive environments, but this hurts automated testing setups (syzbot). In an automated setup we need to strictly order CPU lockup < RCU stall < workqueue lockup < task hung < silent loss, so that RCU stall is not detected as task hung and task hung is not detected as silent machine loss. The large variance in task hung detection timeout requires setting silent machine loss timeout to a very large value (e.g. if task hung is 3 mins, then silent loss need to be set to ~7 mins). The additional 3 minutes significantly reduce testing efficiency because usually we crash kernel within a minute, and this can add hours to bug localization process as it needs to do dozens of tests. Allow setting checking interval separately from timeout. This allows to set timeout to, say, 3 minutes, but checking interval to 10 secs. The interval is controlled via a new hung_task_check_interval_secs sysctl, similar to the existing hung_task_timeout_secs sysctl. The default value of 0 results in the current behavior: checking interval is equal to timeout. [[email protected]: update hung_task_timeout_max's comment] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Dmitry Vyukov <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Tetsuo Handa <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Ingo Molnar <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22kernel/crash_core.c: print timestamp using time64_tArnd Bergmann1-1/+1
The get_seconds() call returns a 32-bit timestamp on some architectures, and will overflow in the future. The newer ktime_get_real_seconds() always returns a 64-bit timestamp that does not suffer from this problem. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Arnd Bergmann <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Dave Young <[email protected]> Cc: Baoquan He <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Petr Tesarik <[email protected]> Cc: Marc-Andr Lureau <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22linux/compiler.h: don't use boolRasmus Villemoes1-1/+1
Appararently, it's possible to have a non-trivial TU include a few headers, including linux/build_bug.h, without ending up with linux/types.h. So the 0day bot sent me config: um-x86_64_defconfig (attached as .config) >> include/linux/compiler.h:316:3: error: unknown type name 'bool'; did you mean '_Bool'? bool __cond = !(condition); \ for something I'm working on. Rather than contributing to the #include madness and including linux/types.h from compiler.h, just use int. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Rasmus Villemoes <[email protected]> Cc: Christopher Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22userns: use irqsave variant of refcount_dec_and_lock()Anna-Maria Gleixner1-4/+1
The irqsave variant of refcount_dec_and_lock handles irqsave/restore when taking/releasing the spin lock. With this variant the call of local_irq_save/restore is no longer required. [[email protected]: s@atomic_dec_and_lock@refcount_dec_and_lock@g] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Anna-Maria Gleixner <[email protected]> Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Cc: "Eric W. Biederman" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22userns: use refcount_t for reference counting instead atomic_tSebastian Andrzej Siewior2-6/+7
refcount_t type and corresponding API should be used instead of atomic_t wh en the variable is used as a reference counter. This avoids accidental refcounter overflows that might lead to use-after-free situations. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Suggested-by: Peter Zijlstra <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: "Eric W. Biederman" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22bdi: use irqsave variant of refcount_dec_and_lock()Anna-Maria Gleixner1-4/+1
The irqsave variant of refcount_dec_and_lock handles irqsave/restore when taking/releasing the spin lock. With this variant the call of local_irq_save/restore is no longer required. [[email protected]: s@atomic_dec_and_lock@refcount_dec_and_lock@g] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Anna-Maria Gleixner <[email protected]> Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22bdi: use refcount_t for reference counting instead atomic_tSebastian Andrzej Siewior3-9/+10
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This permits avoiding accidental refcounter overflows that might lead to use-after-free situations. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Suggested-by: Peter Zijlstra <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22kernel.h: documentation for roundup() vs round_up()Kees Cook1-1/+34
Things like 3619dec5103d ("dh key: fix rounding up KDF output length") expose the lack of explicit documentation for roundup() vs round_up(). At least we can try to document it better if anyone goes looking. Link: http://lkml.kernel.org/r/20180703041950.GA43464@beast Signed-off-by: Kees Cook <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Randy Dunlap <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22include/asm-generic/bug.h: clarify valid uses of WARN()Dmitry Vyukov1-3/+13
Explicitly state that WARN*() should be used only for recoverable kernel issues/bugs and that it should not be used for any kind of invalid external inputs or transient conditions. Motivation: it's a very useful capability to be able to understand if a particular kernel splat means a kernel bug or simply an invalid user-space program. For the former one wants to notify kernel developers, while notifying kernel developers for the latter is annoying. Even a kernel developer may not know what to do with a WARNING in an unfamiliar subsystem. This is especially critical for any automated testing systems that may use panic_on_warn and mail kernel developers. The clear separation also serves as an additional documentation: is it a condition that must never occur because of additional checks/logic elsewhere? or is it simply a check for invalid inputs or unfortunate conditions? Use of pr_err() for user messages also leads to better error messages. "Something is wrong in file foo on line X" is not particularly useful message for end user. pr_err() forces developers to write more meaningful error messages for user. As of now we are almost there. We are doing systematic kernel testing with panic_on_warn and are not seeing massive amounts of false positives. But every now and then another WARN on ENOMEM or invalid inputs pops up and leads to a lengthy argument each time. The goal of this change is to officially document the rules. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Dmitry Vyukov <[email protected]> Acked-by: Greg Kroah-Hartman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22proc/kcore: add vmcoreinfo note to /proc/kcoreOmar Sandoval4-4/+21
The vmcoreinfo information is useful for runtime debugging tools, not just for crash dumps. A lot of this information can be determined by other means, but this is much more convenient, and it only adds a page at most to the file. Link: http://lkml.kernel.org/r/fddbcd08eed76344863303878b12de1c1e2a04b6.1531953780.git.osandov@fb.com Signed-off-by: Omar Sandoval <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Bhupesh Sharma <[email protected]> Cc: Eric Biederman <[email protected]> Cc: James Morse <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22crash_core: use VMCOREINFO_SYMBOL_ARRAY() for swapper_pg_dirOmar Sandoval1-1/+1
This is preparation for allowing CRASH_CORE to be enabled for any architecture. swapper_pg_dir is always either an array or a macro expanding to NULL. In the latter case, VMCOREINFO_SYMBOL() won't work, as it tries to take the address of the given symbol: #define VMCOREINFO_SYMBOL(name) \ vmcoreinfo_append_str("SYMBOL(%s)=%lx\n", #name, (unsigned long)&name) Instead, use VMCOREINFO_SYMBOL_ARRAY(), which uses the value: #define VMCOREINFO_SYMBOL_ARRAY(name) \ vmcoreinfo_append_str("SYMBOL(%s)=%lx\n", #name, (unsigned long)name) This is the same thing for the array case but isn't an error for the macro case. Link: http://lkml.kernel.org/r/c05f9781ec204f40fc96f95086e7b6de6a3eb2c3.1532563124.git.osandov@fb.com Signed-off-by: Omar Sandoval <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Bhupesh Sharma <[email protected]> Cc: Eric Biederman <[email protected]> Cc: James Morse <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22proc/kcore: optimize multiple page readsOmar Sandoval1-3/+11
The current code does a full search of the segment list every time for every page. This is wasteful, since it's almost certain that the next page will be in the same segment. Instead, check if the previous segment covers the current page before doing the list search. Link: http://lkml.kernel.org/r/fd346c11090cf93d867e01b8d73a6567c5ac6361.1531953780.git.osandov@fb.com Signed-off-by: Omar Sandoval <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Bhupesh Sharma <[email protected]> Cc: Eric Biederman <[email protected]> Cc: James Morse <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22proc/kcore: clean up ELF header generationOmar Sandoval1-209/+141
Currently, the ELF file header, program headers, and note segment are allocated all at once, in some icky code dating back to 2.3. Programs tend to read the file header, then the program headers, then the note segment, all separately, so this is a waste of effort. It's cleaner and more efficient to handle the three separately. Link: http://lkml.kernel.org/r/19c92cbad0e11f6103ff3274b2e7a7e51a1eb74b.1531953780.git.osandov@fb.com Signed-off-by: Omar Sandoval <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Bhupesh Sharma <[email protected]> Cc: Eric Biederman <[email protected]> Cc: James Morse <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22proc/kcore: hold lock during readOmar Sandoval1-30/+40
Now that we're using an rwsem, we can hold it during the entirety of read_kcore() and have a common return path. This is preparation for the next change. [[email protected]: fix locking bug reported by Tetsuo Handa] Link: http://lkml.kernel.org/r/d7cfbc1e8a76616f3b699eaff9df0a2730380534.1531953780.git.osandov@fb.com Signed-off-by: Omar Sandoval <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Bhupesh Sharma <[email protected]> Cc: Eric Biederman <[email protected]> Cc: James Morse <[email protected]> Cc: Tetsuo Handa <[email protected]> Cc: Stephen Rothwell <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22proc/kcore: fix memory hotplug vs multiple opens raceOmar Sandoval1-49/+44
There's a theoretical race condition that will cause /proc/kcore to miss a memory hotplug event: CPU0 CPU1 // hotplug event 1 kcore_need_update = 1 open_kcore() open_kcore() kcore_update_ram() kcore_update_ram() // Walk RAM // Walk RAM __kcore_update_ram() __kcore_update_ram() kcore_need_update = 0 // hotplug event 2 kcore_need_update = 1 kcore_need_update = 0 Note that CPU1 set up the RAM kcore entries with the state after hotplug event 1 but cleared the flag for hotplug event 2. The RAM entries will therefore be stale until there is another hotplug event. This is an extremely unlikely sequence of events, but the fix makes the synchronization saner, anyways: we serialize the entire update sequence, which means that whoever clears the flag will always succeed in replacing the kcore list. Link: http://lkml.kernel.org/r/6106c509998779730c12400c1b996425df7d7089.1531953780.git.osandov@fb.com Signed-off-by: Omar Sandoval <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Bhupesh Sharma <[email protected]> Cc: Eric Biederman <[email protected]> Cc: James Morse <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22proc/kcore: replace kclist_lock rwlock with rwsemOmar Sandoval1-10/+10
Now we only need kclist_lock from user context and at fs init time, and the following changes need to sleep while holding the kclist_lock. Link: http://lkml.kernel.org/r/521ba449ebe921d905177410fee9222d07882f0d.1531953780.git.osandov@fb.com Signed-off-by: Omar Sandoval <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Bhupesh Sharma <[email protected]> Cc: Eric Biederman <[email protected]> Cc: James Morse <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22proc/kcore: don't grab lock for memory hotplug notifierOmar Sandoval1-4/+2
The memory hotplug notifier kcore_callback() only needs kclist_lock to prevent races with __kcore_update_ram(), but we can easily eliminate that race by using an atomic xchg() in __kcore_update_ram(). This is preparation for converting kclist_lock to an rwsem. Link: http://lkml.kernel.org/r/0a4bc89f4dbde8b5b2ea309f7b4fb6a85fe29df2.1531953780.git.osandov@fb.com Signed-off-by: Omar Sandoval <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Bhupesh Sharma <[email protected]> Cc: Eric Biederman <[email protected]> Cc: James Morse <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22proc/kcore: don't grab lock for kclist_add()Omar Sandoval2-5/+4
Patch series "/proc/kcore improvements", v4. This series makes a few improvements to /proc/kcore. It fixes a couple of small issues in v3 but is otherwise the same. Patches 1, 2, and 3 are prep patches. Patch 4 is a fix/cleanup. Patch 5 is another prep patch. Patches 6 and 7 are optimizations to ->read(). Patch 8 makes it possible to enable CRASH_CORE on any architecture, which is needed for patch 9. Patch 9 adds vmcoreinfo to /proc/kcore. This patch (of 9): kclist_add() is only called at init time, so there's no point in grabbing any locks. We're also going to replace the rwlock with a rwsem, which we don't want to try grabbing during early boot. While we're here, mark kclist_add() with __init so that we'll get a warning if it's called from non-init code. Link: http://lkml.kernel.org/r/98208db1faf167aa8b08eebfa968d95c70527739.1531953780.git.osandov@fb.com Signed-off-by: Omar Sandoval <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Reviewed-by: Bhupesh Sharma <[email protected]> Tested-by: Bhupesh Sharma <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Bhupesh Sharma <[email protected]> Cc: Eric Biederman <[email protected]> Cc: James Morse <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22fs/proc/kcore.c: use __pa_symbol() for KCORE_TEXT list entriesJames Morse1-1/+3
elf_kcore_store_hdr() uses __pa() to find the physical address of KCORE_RAM or KCORE_TEXT entries exported as program headers. This trips CONFIG_DEBUG_VIRTUAL's checks, as the KCORE_TEXT entries are not in the linear map. Handle these two cases separately, using __pa_symbol() for the KCORE_TEXT entries. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: James Morse <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Omar Sandoval <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22fs/proc/vmcore.c: use new typedef vm_fault_tSouptick Joarder1-1/+1
Use new return type vm_fault_t for fault handler in struct vm_operations_struct. For now, this is just documenting that the function returns a VM_FAULT value rather than an errno. Once all instances are converted, vm_fault_t will become a distinct type. See 1c8f422059ae ("mm: change return type to vm_fault_t") for reference. Link: http://lkml.kernel.org/r/20180702153325.GA3875@jordon-HP-15-Notebook-PC Signed-off-by: Souptick Joarder <[email protected]> Reviewed-by: Matthew Wilcox <[email protected]> Cc: Ganesh Goudar <[email protected]> Cc: Rahul Lakkireddy <[email protected]> Cc: David S. Miller <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22proc: use "unsigned int" in /proc/stat hookAlexey Dobriyan1-1/+1
Number of CPUs is never high enough to force 64-bit arithmetic. Save couple of bytes on x86_64. Link: http://lkml.kernel.org/r/20180627200710.GC18434@avx2 Signed-off-by: Alexey Dobriyan <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22proc: spread "const" a bitAlexey Dobriyan2-3/+3
Link: http://lkml.kernel.org/r/20180627200614.GB18434@avx2 Signed-off-by: Alexey Dobriyan <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22proc: use macro in /proc/latency hookAlexey Dobriyan1-1/+1
->latency_record is defined as struct latency_record[LT_SAVECOUNT]; so use the same macro whie iterating. Link: http://lkml.kernel.org/r/20180627200534.GA18434@avx2 Signed-off-by: Alexey Dobriyan <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22proc: save 2 atomic ops on write to "/proc/*/attr/*"Alexey Dobriyan1-19/+19
Code checks if write is done by current to its own attributes. For that get/put pair is unnecessary as it can be done under RCU. Note: rcu_read_unlock() can be done even earlier since pointer to a task is not dereferenced. It depends if /proc code should look scary or not: rcu_read_lock(); task = pid_task(...); rcu_read_unlock(); if (!task) return -ESRCH; if (task != current) return -EACCESS: P.S.: rename "length" variable. Code like this length = -EINVAL; should not exist. Link: http://lkml.kernel.org/r/20180627200218.GF18113@avx2 Signed-off-by: Alexey Dobriyan <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22proc: put task earlier in /proc/*/fail-nthAlexey Dobriyan1-3/+1
Link: http://lkml.kernel.org/r/20180627195427.GE18113@avx2 Signed-off-by: Alexey Dobriyan <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22proc: smaller readlock section in readdir("/proc")Alexey Dobriyan1-2/+2
Readdir context is thread local, so ->pos is thread local, move it out of readlock. Link: http://lkml.kernel.org/r/20180627195339.GD18113@avx2 Signed-off-by: Alexey Dobriyan <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22proc: test /proc/thread-self symlinkAlexey Dobriyan4-0/+71
Same story: I have WIP patch to make it faster, so better have a test as well. Link: http://lkml.kernel.org/r/20180627195209.GC18113@avx2 Signed-off-by: Alexey Dobriyan <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22proc: test /proc/self symlinkAlexey Dobriyan4-0/+49
There are plans to change how /proc/self result is calculated, for that a test is necessary. Use direct system call because of this whole getpid caching story. Link: http://lkml.kernel.org/r/20180627195103.GB18113@avx2 Signed-off-by: Alexey Dobriyan <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22fs/proc/uptime.c: use ktime_get_boottime_ts64Arnd Bergmann1-2/+2
get_monotonic_boottime() is deprecated and uses the old timespec type. Let's convert /proc/uptime to use ktime_get_boottime_ts64(). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Arnd Bergmann <[email protected]> Acked-by: Thomas Gleixner <[email protected]> Cc: Al Viro <[email protected]> Cc: Deepa Dinamani <[email protected]> Cc: Alexey Dobriyan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22proc: fixup PDE allocation bloatAlexey Dobriyan2-12/+11
24074a35c5c975 ("proc: Make inline name size calculation automatic") started to put PDE allocations into kmalloc-256 which is unnecessary as ~40 character names are very rare. Put allocation back into kmalloc-192 cache for 64-bit non-debug builds. Put BUILD_BUG_ON to know when PDE size has gotten out of control. [[email protected]: fix BUILD_BUG_ON breakage on powerpc64] Link: http://lkml.kernel.org/r/20180703191602.GA25521@avx2 Link: http://lkml.kernel.org/r/20180617215732.GA24688@avx2 Signed-off-by: Alexey Dobriyan <[email protected]> Cc: David Howells <[email protected]> Cc: Al Viro <[email protected]> Cc: Stephen Rothwell <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22mm: fix comment for NODEMASK_ALLOCOscar Salvador1-1/+1
Currently, NODEMASK_ALLOC allocates a nodemask_t with kmalloc when NODES_SHIFT is higher than 8, otherwise it declares it within the stack. The comment says that the reasoning behind this, is that nodemask_t will be 256 bytes when NODES_SHIFT is higher than 8, but this is not true. For example, NODES_SHIFT = 9 will give us a 64 bytes nodemask_t. Let us fix up the comment for that. Another thing is that it might make sense to let values lower than 128bytes be allocated in the stack. Although this all depends on the depth of the stack (and this changes from function to function), I think that 64 bytes is something we can easily afford. So we could even bump the limit by 1 (from > 8 to > 9). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Oscar Salvador <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22drivers/block/zram/zram_drv.c: fix bug storing backing_devPeter Kalauskas1-1/+6
The call to strlcpy in backing_dev_store is incorrect. It should take the size of the destination buffer instead of the size of the source buffer. Additionally, ignore the newline character (\n) when reading the new file_name buffer. This makes it possible to set the backing_dev as follows: echo /dev/sdX > /sys/block/zram0/backing_dev The reason it worked before was the fact that strlcpy() copies 'len - 1' bytes, which is strlen(buf) - 1 in our case, so it accidentally didn't copy the trailing new line symbol. Which also means that "echo -n /dev/sdX" most likely was broken. Signed-off-by: Peter Kalauskas <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Acked-by: Minchan Kim <[email protected]> Reviewed-by: Sergey Senozhatsky <[email protected]> Cc: <[email protected]> [4.14+] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22/proc/meminfo: add percpu populated pages countDennis Zhou (Facebook)4-0/+36
Currently, percpu memory only exposes allocation and utilization information via debugfs. This more or less is only really useful for understanding the fragmentation and allocation information at a per-chunk level with a few global counters. This is also gated behind a config. BPF and cgroup, for example, have seen an increase in use causing increased use of percpu memory. Let's make it easier for someone to identify how much memory is being used. This patch adds the "Percpu" stat to meminfo to more easily look up how much percpu memory is in use. This number includes the cost for all allocated backing pages and not just insight at the per a unit, per chunk level. Metadata is excluded. I think excluding metadata is fair because the backing memory scales with the numbere of cpus and can quickly outweigh the metadata. It also makes this calculation light. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Dennis Zhou <[email protected]> Acked-by: Tejun Heo <[email protected]> Acked-by: Roman Gushchin <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Acked-by: David Rientjes <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Alexey Dobriyan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22mm, oom: introduce memory.oom.groupRoman Gushchin4-0/+159
For some workloads an intervention from the OOM killer can be painful. Killing a random task can bring the workload into an inconsistent state. Historically, there are two common solutions for this problem: 1) enabling panic_on_oom, 2) using a userspace daemon to monitor OOMs and kill all outstanding processes. Both approaches have their downsides: rebooting on each OOM is an obvious waste of capacity, and handling all in userspace is tricky and requires a userspace agent, which will monitor all cgroups for OOMs. In most cases an in-kernel after-OOM cleaning-up mechanism can eliminate the necessity of enabling panic_on_oom. Also, it can simplify the cgroup management for userspace applications. This commit introduces a new knob for cgroup v2 memory controller: memory.oom.group. The knob determines whether the cgroup should be treated as an indivisible workload by the OOM killer. If set, all tasks belonging to the cgroup or to its descendants (if the memory cgroup is not a leaf cgroup) are killed together or not at all. To determine which cgroup has to be killed, we do traverse the cgroup hierarchy from the victim task's cgroup up to the OOMing cgroup (or root) and looking for the highest-level cgroup with memory.oom.group set. Tasks with the OOM protection (oom_score_adj set to -1000) are treated as an exception and are never killed. This patch doesn't change the OOM victim selection algorithm. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Roman Gushchin <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: David Rientjes <[email protected]> Cc: Tetsuo Handa <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Vladimir Davydov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22mm, oom: refactor oom_kill_process()Roman Gushchin1-58/+65
Patch series "introduce memory.oom.group", v2. This is a tiny implementation of cgroup-aware OOM killer, which adds an ability to kill a cgroup as a single unit and so guarantee the integrity of the workload. Although it has only a limited functionality in comparison to what now resides in the mm tree (it doesn't change the victim task selection algorithm, doesn't look at memory stas on cgroup level, etc), it's also much simpler and more straightforward. So, hopefully, we can avoid having long debates here, as we had with the full implementation. As it doesn't prevent any futher development, and implements an useful and complete feature, it looks as a sane way forward. This patch (of 2): oom_kill_process() consists of two logical parts: the first one is responsible for considering task's children as a potential victim and printing the debug information. The second half is responsible for sending SIGKILL to all tasks sharing the mm struct with the given victim. This commit splits oom_kill_process() with an intention to re-use the the second half: __oom_kill_process(). The cgroup-aware OOM killer will kill multiple tasks belonging to the victim cgroup. We don't need to print the debug information for the each task, as well as play with task selection (considering task's children), so we can't use the existing oom_kill_process(). Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Roman Gushchin <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Tetsuo Handa <[email protected]> Cc: David Rientjes <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22tools/testing/selftests/vm/: add MAP_POPULATE testDmitry Safonov4-0/+126
As with many other projects, we use some shmalloc allocator. At some point we need to make a part of allocated pages back private to process. And it should be populated straight away. Check that (MAP_PRIVATE | MAP_POPULATE) actually copies the private page. [[email protected]: change message, per review discussion] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Dmitry Safonov <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Dmitry Safonov <[email protected]> Cc: Hua Zhong <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Stuart Ritchie <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-08-22mm/page_alloc: Introduce free_area_init_core_hotplugOscar Salvador4-36/+61
Currently, whenever a new node is created/re-used from the memhotplug path, we call free_area_init_node()->free_area_init_core(). But there is some code that we do not really need to run when we are coming from such path. free_area_init_core() performs the following actions: 1) Initializes pgdat internals, such as spinlock, waitqueues and more. 2) Account # nr_all_pages and # nr_kernel_pages. These values are used later on when creating hash tables. 3) Account number of managed_pages per zone, substracting dma_reserved and memmap pages. 4) Initializes some fields of the zone structure data 5) Calls init_currently_empty_zone to initialize all the freelists 6) Calls memmap_init to initialize all pages belonging to certain zone When called from memhotplug path, free_area_init_core() only performs actions #1 and #4. Action #2 is pointless as the zones do not have any pages since either the node was freed, or we are re-using it, eitherway all zones belonging to this node should have 0 pages. For the same reason, action #3 results always in manages_pages being 0. Action #5 and #6 are performed later on when onlining the pages: online_pages()->move_pfn_range_to_zone()->init_currently_empty_zone() online_pages()->move_pfn_range_to_zone()->memmap_init_zone() This patch does two things: First, moves the node/zone initializtion to their own function, so it allows us to create a small version of free_area_init_core, where we only perform: 1) Initialization of pgdat internals, such as spinlock, waitqueues and more 4) Initialization of some fields of the zone structure data These two functions are: pgdat_init_internals() and zone_init_internals(). The second thing this patch does, is to introduce free_area_init_core_hotplug(), the memhotplug version of free_area_init_core(): Currently, we call free_area_init_node() from the memhotplug path. In there, we set some pgdat's fields, and call calculate_node_totalpages(). calculate_node_totalpages() calculates the # of pages the node has. Since the node is either new, or we are re-using it, the zones belonging to this node should not have any pages, so there is no point to calculate this now. Actually, we re-set these values to 0 later on with the calls to: reset_node_managed_pages() reset_node_present_pages() The # of pages per node and the # of pages per zone will be calculated when onlining the pages: online_pages()->move_pfn_range()->move_pfn_range_to_zone()->resize_zone_range() online_pages()->move_pfn_range()->move_pfn_range_to_zone()->resize_pgdat_range() Also, since free_area_init_core/free_area_init_node will now only get called during early init, let us replace __paginginit with __init, so their code gets freed up. [[email protected]: fix section usage] Link: http://lkml.kernel.org/r/[email protected] [[email protected]: v6] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Oscar Salvador <[email protected]> Reviewed-by: Pavel Tatashin <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Pasha Tatashin <[email protected]> Cc: Aaron Lu <[email protected]> Cc: Dan Williams <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>