aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2019-05-14kernel/latencytop.c: remove unnecessary checks for latencytop_enabledLin Feng1-6/+0
1. In latencytop source codes, we only have such calling chain: account_scheduler_latency(struct task_struct *task, int usecs, int inter) { if (unlikely(latencytop_enabled)) /* the outtermost check */ __account_scheduler_latency(task, usecs, inter); } __account_scheduler_latency account_global_scheduler_latency if (!latencytop_enabled) So, the inner check for latencytop_enabled is not necessary at all. 2. In clear_all_latency_tracing and now is called clear_tsk_latency_tracing the check for latencytop_enabled is redundant and buggy to some extent. We have no reason to refuse clearing the /proc/$pid/latency if latencytop_enabled is set to 0, considering that if we use latencytop manually by echo 0 > /proc/sys/kernel/latencytop, then we want to clear /proc/$pid/latency and failed. Also we don't have such check in brother function clear_global_latency_tracing. Notes: These changes are only visible to users who set CONFIG_LATENCYTOP and won't change user tool latencytop's behavior. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Lin Feng <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Fabian Frederick <[email protected]> Cc: Arjan van de Ven <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14kernel/notifier.c: double register detectionVasily Averin1-0/+1
By design notifiers can be registerd once only, 2nd register attempt called by mistake silently corrupts notifiers list. A few years ago I investigated described problem, the host was power cycled because of notifier list corruption. I've prepared this patch and applied it to the OpenVZ kernel and sent this patch but nobody commented on it. Later it helped us to detect a similar problem in the OpenVz kernel. Mistakes with notifier registration can happen for example during subsystem initialization from different namespaces, or because of a lost unregister in the roll-back path on initialization failures. The proposed check cannot prevent the described problem, however it allows us to detect its reason quickly without coredump analysis. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Vasily Averin <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14compiler: allow all arches to enable CONFIG_OPTIMIZE_INLININGMasahiro Yamada4-19/+15
Commit 60a3cdd06394 ("x86: add optimized inlining") introduced CONFIG_OPTIMIZE_INLINING, but it has been available only for x86. The idea is obviously arch-agnostic. This commit moves the config entry from arch/x86/Kconfig.debug to lib/Kconfig.debug so that all architectures can benefit from it. This can make a huge difference in kernel image size especially when CONFIG_OPTIMIZE_FOR_SIZE is enabled. For example, I got 3.5% smaller arm64 kernel for v5.1-rc1. dec file 18983424 arch/arm64/boot/Image.before 18321920 arch/arm64/boot/Image.after This also slightly improves the "Kernel hacking" Kconfig menu as e61aca5158a8 ("Merge branch 'kconfig-diet' from Dave Hansen') suggested; this config option would be a good fit in the "compiler option" menu. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Masahiro Yamada <[email protected]> Acked-by: Borislav Petkov <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Boris Brezillon <[email protected]> Cc: Brian Norris <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: David Woodhouse <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Marek Vasut <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Mathieu Malaterre <[email protected]> Cc: Miquel Raynal <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Russell King <[email protected]> Cc: Stefan Agner <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14powerpc/mm/radix: mark as __tlbie_pid() and friends as__always_inlineMasahiro Yamada1-4/+4
This prepares to move CONFIG_OPTIMIZE_INLINING from x86 to a common place. We need to eliminate potential issues beforehand. If it is enabled for powerpc, the following errors are reported: arch/powerpc/mm/tlb-radix.c: In function '__tlbie_lpid': arch/powerpc/mm/tlb-radix.c:148:2: warning: asm operand 3 probably doesn't match constraints asm volatile(PPC_TLBIE_5(%0, %4, %3, %2, %1) ^~~ arch/powerpc/mm/tlb-radix.c:148:2: error: impossible constraint in 'asm' arch/powerpc/mm/tlb-radix.c: In function '__tlbie_pid': arch/powerpc/mm/tlb-radix.c:118:2: warning: asm operand 3 probably doesn't match constraints asm volatile(PPC_TLBIE_5(%0, %4, %3, %2, %1) ^~~ arch/powerpc/mm/tlb-radix.c: In function '__tlbiel_pid': arch/powerpc/mm/tlb-radix.c:104:2: warning: asm operand 3 probably doesn't match constraints asm volatile(PPC_TLBIEL(%0, %4, %3, %2, %1) ^~~ Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Masahiro Yamada <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Boris Brezillon <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Brian Norris <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: David Woodhouse <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Marek Vasut <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Mathieu Malaterre <[email protected]> Cc: Miquel Raynal <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Russell King <[email protected]> Cc: Stefan Agner <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14powerpc/mm/radix: mark __radix__flush_tlb_range_psize() as __always_inlineMasahiro Yamada1-1/+1
This prepares to move CONFIG_OPTIMIZE_INLINING from x86 to a common place. We need to eliminate potential issues beforehand. If it is enabled for powerpc, the following error is reported: arch/powerpc/mm/tlb-radix.c: In function '__radix__flush_tlb_range_psize': arch/powerpc/mm/tlb-radix.c:104:2: error: asm operand 3 probably doesn't match constraints [-Werror] asm volatile(PPC_TLBIEL(%0, %4, %3, %2, %1) ^~~ arch/powerpc/mm/tlb-radix.c:104:2: error: impossible constraint in 'asm' Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Masahiro Yamada <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Boris Brezillon <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Brian Norris <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: David Woodhouse <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Marek Vasut <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Mathieu Malaterre <[email protected]> Cc: Miquel Raynal <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Russell King <[email protected]> Cc: Stefan Agner <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14powerpc/prom_init: mark prom_getprop() and prom_getproplen() as __initMasahiro Yamada1-3/+3
This prepares to move CONFIG_OPTIMIZE_INLINING from x86 to a common place. We need to eliminate potential issues beforehand. If it is enabled for powerpc, the following modpost warnings are reported: WARNING: vmlinux.o(.text.unlikely+0x20): Section mismatch in reference from the function .prom_getprop() to the function .init.text:.call_prom() The function .prom_getprop() references the function __init .call_prom(). This is often because .prom_getprop lacks a __init annotation or the annotation of .call_prom is wrong. WARNING: vmlinux.o(.text.unlikely+0x3c): Section mismatch in reference from the function .prom_getproplen() to the function .init.text:.call_prom() The function .prom_getproplen() references the function __init .call_prom(). This is often because .prom_getproplen lacks a __init annotation or the annotation of .call_prom is wrong. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Masahiro Yamada <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Boris Brezillon <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Brian Norris <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: David Woodhouse <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Marek Vasut <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Mathieu Malaterre <[email protected]> Cc: Miquel Raynal <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Russell King <[email protected]> Cc: Stefan Agner <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14ARM: mark setup_machine_tags() stub as __init __noreturnMasahiro Yamada1-1/+1
This prepares to move CONFIG_OPTIMIZE_INLINING from x86 to a common place. We need to eliminate potential issues beforehand. If it is enabled for arm, Clang build results in the following modpost warning: WARNING: vmlinux.o(.text+0x1124): Section mismatch in reference from the function setup_machine_tags() to the function .init.text:early_print() The function setup_machine_tags() references the function __init early_print(). This is often because setup_machine_tags lacks a __init annotation or the annotation of early_print is wrong. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Masahiro Yamada <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Boris Brezillon <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Brian Norris <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: David Woodhouse <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Marek Vasut <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Mathieu Malaterre <[email protected]> Cc: Miquel Raynal <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Russell King <[email protected]> Cc: Stefan Agner <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14MIPS: mark __fls() and __ffs() as __always_inlineMasahiro Yamada1-2/+2
This prepares to move CONFIG_OPTIMIZE_INLINING from x86 to a common place. We need to eliminate potential issues beforehand. If it is enabled for mips, the following errors are reported: arch/mips/mm/sc-mips.o: In function `mips_sc_prefetch_enable.part.2': sc-mips.c:(.text+0x98): undefined reference to `mips_gcr_base' sc-mips.c:(.text+0x9c): undefined reference to `mips_gcr_base' sc-mips.c:(.text+0xbc): undefined reference to `mips_gcr_base' sc-mips.c:(.text+0xc8): undefined reference to `mips_gcr_base' sc-mips.c:(.text+0xdc): undefined reference to `mips_gcr_base' arch/mips/mm/sc-mips.o:sc-mips.c:(.text.unlikely+0x44): more undefined references to `mips_gcr_base' Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Masahiro Yamada <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Boris Brezillon <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Brian Norris <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: David Woodhouse <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Marek Vasut <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Mathieu Malaterre <[email protected]> Cc: Miquel Raynal <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Russell King <[email protected]> Cc: Stefan Agner <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14mtd: rawnand: vf610_nfc: add initializer to avoid -Wmaybe-uninitializedMasahiro Yamada1-1/+1
This prepares to move CONFIG_OPTIMIZE_INLINING from x86 to a common place. We need to eliminate potential issues beforehand. Kbuild test robot has never reported -Wmaybe-uninitialized warning for this probably because vf610_nfc_run() is inlined by the x86 compiler's inlining heuristic. If CONFIG_OPTIMIZE_INLINING is enabled for a different architecture and vf610_nfc_run() is not inlined, the following warning is reported: drivers/mtd/nand/raw/vf610_nfc.c: In function `vf610_nfc_cmd': drivers/mtd/nand/raw/vf610_nfc.c:455:3: warning: `offset' may be used uninitialized in this function [-Wmaybe-uninitialized] vf610_nfc_rd_from_sram(instr->ctx.data.buf.in + offset, ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ nfc->regs + NFC_MAIN_AREA(0) + offset, ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ trfr_sz, !nfc->data_access); ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Masahiro Yamada <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Boris Brezillon <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Brian Norris <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: David Woodhouse <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Marek Vasut <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Mathieu Malaterre <[email protected]> Cc: Miquel Raynal <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Russell King <[email protected]> Cc: Stefan Agner <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14s390/cpacf: mark scpacf_query() as __always_inlineMasahiro Yamada1-1/+1
This prepares to move CONFIG_OPTIMIZE_INLINING from x86 to a common place. We need to eliminate potential issues beforehand. If it is enabled for s390, the following error is reported: In file included from arch/s390/crypto/des_s390.c:19: arch/s390/include/asm/cpacf.h: In function 'cpacf_query': arch/s390/include/asm/cpacf.h:170:2: warning: asm operand 3 probably doesn't match constraints asm volatile( ^~~ arch/s390/include/asm/cpacf.h:170:2: error: impossible constraint in 'asm' Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Masahiro Yamada <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Boris Brezillon <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Brian Norris <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: David Woodhouse <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Marek Vasut <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Mathieu Malaterre <[email protected]> Cc: Miquel Raynal <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Russell King <[email protected]> Cc: Stefan Agner <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14MIPS: mark mult_sh_align_mod() as __always_inlineMasahiro Yamada1-2/+2
This prepares to move CONFIG_OPTIMIZE_INLINING from x86 to a common place. We need to eliminate potential issues beforehand. If it is enabled for mips, the following error is reported: arch/mips/kernel/cpu-bugs64.c: In function 'mult_sh_align_mod.constprop': arch/mips/kernel/cpu-bugs64.c:33:2: error: asm operand 1 probably doesn't match constraints [-Werror] asm volatile( ^~~ arch/mips/kernel/cpu-bugs64.c:33:2: error: asm operand 1 probably doesn't match constraints [-Werror] asm volatile( ^~~ arch/mips/kernel/cpu-bugs64.c:33:2: error: impossible constraint in 'asm' asm volatile( ^~~ arch/mips/kernel/cpu-bugs64.c:33:2: error: impossible constraint in 'asm' asm volatile( ^~~ Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Masahiro Yamada <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Boris Brezillon <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Brian Norris <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: David Woodhouse <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Marek Vasut <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Mathieu Malaterre <[email protected]> Cc: Miquel Raynal <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Russell King <[email protected]> Cc: Stefan Agner <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14arm64: mark (__)cpus_have_const_cap as __always_inlineMasahiro Yamada1-2/+2
This prepares to move CONFIG_OPTIMIZE_INLINING from x86 to a common place. We need to eliminate potential issues beforehand. If it is enabled for arm64, the following errors are reported: In file included from include/linux/compiler_types.h:68, from <command-line>: arch/arm64/include/asm/jump_label.h: In function 'cpus_have_const_cap': include/linux/compiler-gcc.h:120:38: warning: asm operand 0 probably doesn't match constraints #define asm_volatile_goto(x...) do { asm goto(x); asm (""); } while (0) ^~~ arch/arm64/include/asm/jump_label.h:32:2: note: in expansion of macro 'asm_volatile_goto' asm_volatile_goto( ^~~~~~~~~~~~~~~~~ include/linux/compiler-gcc.h:120:38: error: impossible constraint in 'asm' #define asm_volatile_goto(x...) do { asm goto(x); asm (""); } while (0) ^~~ arch/arm64/include/asm/jump_label.h:32:2: note: in expansion of macro 'asm_volatile_goto' asm_volatile_goto( ^~~~~~~~~~~~~~~~~ Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Masahiro Yamada <[email protected]> Tested-by: Mark Rutland <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Boris Brezillon <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Brian Norris <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: David Woodhouse <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Marek Vasut <[email protected]> Cc: Mathieu Malaterre <[email protected]> Cc: Miquel Raynal <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Russell King <[email protected]> Cc: Stefan Agner <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14ARM: prevent tracing IPI_CPU_BACKTRACEArnd Bergmann2-1/+6
Patch series "compiler: allow all arches to enable CONFIG_OPTIMIZE_INLINING", v3. This patch (of 11): When function tracing for IPIs is enabled, we get a warning for an overflow of the ipi_types array with the IPI_CPU_BACKTRACE type as triggered by raise_nmi(): arch/arm/kernel/smp.c: In function 'raise_nmi': arch/arm/kernel/smp.c:489:2: error: array subscript is above array bounds [-Werror=array-bounds] trace_ipi_raise(target, ipi_types[ipinr]); This is a correct warning as we actually overflow the array here. This patch raise_nmi() to call __smp_cross_call() instead of smp_cross_call(), to avoid calling into ftrace. For clarification, I'm also adding a two new code comments describing how this one is special. The warning appears to have shown up after commit e7273ff49acf ("ARM: 8488/1: Make IPI_CPU_BACKTRACE a "non-secure" SGI"), which changed the number assignment from '15' to '8', but as far as I can tell has existed since the IPI tracepoints were first introduced. If we decide to backport this patch to stable kernels, we probably need to backport e7273ff49acf as well. [[email protected]: rebase on v5.1-rc1] Link: http://lkml.kernel.org/r/[email protected] Fixes: e7273ff49acf ("ARM: 8488/1: Make IPI_CPU_BACKTRACE a "non-secure" SGI") Fixes: 365ec7b17327 ("ARM: add IPI tracepoints") # v3.17 Signed-off-by: Arnd Bergmann <[email protected]> Signed-off-by: Masahiro Yamada <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Mathieu Malaterre <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Stefan Agner <[email protected]> Cc: Boris Brezillon <[email protected]> Cc: Miquel Raynal <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: David Woodhouse <[email protected]> Cc: Brian Norris <[email protected]> Cc: Marek Vasut <[email protected]> Cc: Russell King <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Mark Rutland <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14treewide: remove SPDX "WITH Linux-syscall-note" from kernel-space headersMasahiro Yamada3-3/+3
The "WITH Linux-syscall-note" should be added to headers exported to the user-space. Some kernel-space headers have "WITH Linux-syscall-note", which seems a mistake. [1] arch/x86/include/asm/hyperv-tlfs.h Commit 5a4858032217 ("x86/hyper-v: move hyperv.h out of uapi") moved this file out of uapi, but missed to update the SPDX License tag. [2] include/asm-generic/shmparam.h Commit 76ce2a80a28e ("Rename include/{uapi => }/asm-generic/shmparam.h really") moved this file out of uapi, but missed to update the SPDX License tag. [3] include/linux/qcom-geni-se.h Commit eddac5af0654 ("soc: qcom: Add GENI based QUP Wrapper driver") added this file, but I do not see a good reason why its license tag must include "WITH Linux-syscall-note". Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Masahiro Yamada <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14fs/select: avoid clang stack usage warningArnd Bergmann1-0/+4
The select() implementation is carefully tuned to put a sensible amount of data on the stack for holding a copy of the user space fd_set, but not too large to risk overflowing the kernel stack. When building a 32-bit kernel with clang, we need a little more space than with gcc, which often triggers a warning: fs/select.c:619:5: error: stack frame size of 1048 bytes in function 'core_sys_select' [-Werror,-Wframe-larger-than=] int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp, I experimentally found that for 32-bit ARM, reducing the maximum stack usage by 64 bytes keeps us reliably under the warning limit again. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Arnd Bergmann <[email protected]> Reviewed-by: Andi Kleen <[email protected]> Cc: Nick Desaulniers <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Eric Dumazet <[email protected]> Cc: "Darrick J. Wong" <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14mm/mincore.c: make mincore() more conservativeJiri Kosina1-1/+22
The semantics of what mincore() considers to be resident is not completely clear, but Linux has always (since 2.3.52, which is when mincore() was initially done) treated it as "page is available in page cache". That's potentially a problem, as that [in]directly exposes meta-information about pagecache / memory mapping state even about memory not strictly belonging to the process executing the syscall, opening possibilities for sidechannel attacks. Change the semantics of mincore() so that it only reveals pagecache information for non-anonymous mappings that belog to files that the calling process could (if it tried to) successfully open for writing; otherwise we'd be including shared non-exclusive mappings, which - is the sidechannel - is not the usecase for mincore(), as that's primarily used for data, not (shared) text [[email protected]: v2] Link: http://lkml.kernel.org/r/[email protected] [[email protected]: restructure can_do_mincore() conditions] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jiri Kosina <[email protected]> Signed-off-by: Vlastimil Babka <[email protected]> Acked-by: Josh Snyder <[email protected]> Acked-by: Michal Hocko <[email protected]> Originally-by: Linus Torvalds <[email protected]> Originally-by: Dominique Martinet <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Kevin Easton <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Cyril Hrubis <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Daniel Gruss <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14mm: maintain randomization of page free listsDan Williams4-2/+56
When freeing a page with an order >= shuffle_page_order randomly select the front or back of the list for insertion. While the mm tries to defragment physical pages into huge pages this can tend to make the page allocator more predictable over time. Inject the front-back randomness to preserve the initial randomness established by shuffle_free_memory() when the kernel was booted. The overhead of this manipulation is constrained by only being applied for MAX_ORDER sized pages by default. [[email protected]: coding-style fixes] Link: http://lkml.kernel.org/r/154899812788.3165233.9066631950746578517.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <[email protected]> Reviewed-by: Kees Cook <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Keith Busch <[email protected]> Cc: Robert Elliott <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14mm: move buddy list manipulations into helpersDan Williams5-48/+73
In preparation for runtime randomization of the zone lists, take all (well, most of) the list_*() functions in the buddy allocator and put them in helper functions. Provide a common control point for injecting additional behavior when freeing pages. [[email protected]: fix buddy list helpers] Link: http://lkml.kernel.org/r/155033679702.1773410.13041474192173212653.stgit@dwillia2-desk3.amr.corp.intel.com [[email protected]: remove del_page_from_free_area() migratetype parameter] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/154899812264.3165233.5219320056406926223.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <[email protected]> Signed-off-by: Vlastimil Babka <[email protected]> Tested-by: Tetsuo Handa <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Kees Cook <[email protected]> Cc: Keith Busch <[email protected]> Cc: Robert Elliott <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14mm: shuffle initial free memory to improve memory-side-cache utilizationDan Williams9-2/+302
Patch series "mm: Randomize free memory", v10. This patch (of 3): Randomization of the page allocator improves the average utilization of a direct-mapped memory-side-cache. Memory side caching is a platform capability that Linux has been previously exposed to in HPC (high-performance computing) environments on specialty platforms. In that instance it was a smaller pool of high-bandwidth-memory relative to higher-capacity / lower-bandwidth DRAM. Now, this capability is going to be found on general purpose server platforms where DRAM is a cache in front of higher latency persistent memory [1]. Robert offered an explanation of the state of the art of Linux interactions with memory-side-caches [2], and I copy it here: It's been a problem in the HPC space: http://www.nersc.gov/research-and-development/knl-cache-mode-performance-coe/ A kernel module called zonesort is available to try to help: https://software.intel.com/en-us/articles/xeon-phi-software and this abandoned patch series proposed that for the kernel: https://lkml.kernel.org/r/[email protected] Dan's patch series doesn't attempt to ensure buffers won't conflict, but also reduces the chance that the buffers will. This will make performance more consistent, albeit slower than "optimal" (which is near impossible to attain in a general-purpose kernel). That's better than forcing users to deploy remedies like: "To eliminate this gradual degradation, we have added a Stream measurement to the Node Health Check that follows each job; nodes are rebooted whenever their measured memory bandwidth falls below 300 GB/s." A replacement for zonesort was merged upstream in commit cc9aec03e58f ("x86/numa_emulation: Introduce uniform split capability"). With this numa_emulation capability, memory can be split into cache sized ("near-memory" sized) numa nodes. A bind operation to such a node, and disabling workloads on other nodes, enables full cache performance. However, once the workload exceeds the cache size then cache conflicts are unavoidable. While HPC environments might be able to tolerate time-scheduling of cache sized workloads, for general purpose server platforms, the oversubscribed cache case will be the common case. The worst case scenario is that a server system owner benchmarks a workload at boot with an un-contended cache only to see that performance degrade over time, even below the average cache performance due to excessive conflicts. Randomization clips the peaks and fills in the valleys of cache utilization to yield steady average performance. Here are some performance impact details of the patches: 1/ An Intel internal synthetic memory bandwidth measurement tool, saw a 3X speedup in a contrived case that tries to force cache conflicts. The contrived cased used the numa_emulation capability to force an instance of the benchmark to be run in two of the near-memory sized numa nodes. If both instances were placed on the same emulated they would fit and cause zero conflicts. While on separate emulated nodes without randomization they underutilized the cache and conflicted unnecessarily due to the in-order allocation per node. 2/ A well known Java server application benchmark was run with a heap size that exceeded cache size by 3X. The cache conflict rate was 8% for the first run and degraded to 21% after page allocator aging. With randomization enabled the rate levelled out at 11%. 3/ A MongoDB workload did not observe measurable difference in cache-conflict rates, but the overall throughput dropped by 7% with randomization in one case. 4/ Mel Gorman ran his suite of performance workloads with randomization enabled on platforms without a memory-side-cache and saw a mix of some improvements and some losses [3]. While there is potentially significant improvement for applications that depend on low latency access across a wide working-set, the performance may be negligible to negative for other workloads. For this reason the shuffle capability defaults to off unless a direct-mapped memory-side-cache is detected. Even then, the page_alloc.shuffle=0 parameter can be specified to disable the randomization on those systems. Outside of memory-side-cache utilization concerns there is potentially security benefit from randomization. Some data exfiltration and return-oriented-programming attacks rely on the ability to infer the location of sensitive data objects. The kernel page allocator, especially early in system boot, has predictable first-in-first out behavior for physical pages. Pages are freed in physical address order when first onlined. Quoting Kees: "While we already have a base-address randomization (CONFIG_RANDOMIZE_MEMORY), attacks against the same hardware and memory layouts would certainly be using the predictability of allocation ordering (i.e. for attacks where the base address isn't important: only the relative positions between allocated memory). This is common in lots of heap-style attacks. They try to gain control over ordering by spraying allocations, etc. I'd really like to see this because it gives us something similar to CONFIG_SLAB_FREELIST_RANDOM but for the page allocator." While SLAB_FREELIST_RANDOM reduces the predictability of some local slab caches it leaves vast bulk of memory to be predictably in order allocated. However, it should be noted, the concrete security benefits are hard to quantify, and no known CVE is mitigated by this randomization. Introduce shuffle_free_memory(), and its helper shuffle_zone(), to perform a Fisher-Yates shuffle of the page allocator 'free_area' lists when they are initially populated with free memory at boot and at hotplug time. Do this based on either the presence of a page_alloc.shuffle=Y command line parameter, or autodetection of a memory-side-cache (to be added in a follow-on patch). The shuffling is done in terms of CONFIG_SHUFFLE_PAGE_ORDER sized free pages where the default CONFIG_SHUFFLE_PAGE_ORDER is MAX_ORDER-1 i.e. 10, 4MB this trades off randomization granularity for time spent shuffling. MAX_ORDER-1 was chosen to be minimally invasive to the page allocator while still showing memory-side cache behavior improvements, and the expectation that the security implications of finer granularity randomization is mitigated by CONFIG_SLAB_FREELIST_RANDOM. The performance impact of the shuffling appears to be in the noise compared to other memory initialization work. This initial randomization can be undone over time so a follow-on patch is introduced to inject entropy on page free decisions. It is reasonable to ask if the page free entropy is sufficient, but it is not enough due to the in-order initial freeing of pages. At the start of that process putting page1 in front or behind page0 still keeps them close together, page2 is still near page1 and has a high chance of being adjacent. As more pages are added ordering diversity improves, but there is still high page locality for the low address pages and this leads to no significant impact to the cache conflict rate. [1]: https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/ [2]: https://lkml.kernel.org/r/AT5PR8401MB1169D656C8B5E121752FC0F8AB120@AT5PR8401MB1169.NAMPRD84.PROD.OUTLOOK.COM [3]: https://lkml.org/lkml/2018/10/12/309 [[email protected]: fix shuffle enable] Link: http://lkml.kernel.org/r/154943713038.3858443.4125180191382062871.stgit@dwillia2-desk3.amr.corp.intel.com [[email protected]: fix SHUFFLE_PAGE_ALLOCATOR help texts] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/154899811738.3165233.12325692939590944259.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <[email protected]> Signed-off-by: Qian Cai <[email protected]> Reviewed-by: Kees Cook <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Keith Busch <[email protected]> Cc: Robert Elliott <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14mm/vmalloc.c: convert vmap_lazy_nr to atomic_long_tUladzislau Rezki (Sony)1-10/+10
vmap_lazy_nr variable has atomic_t type that is 4 bytes integer value on both 32 and 64 bit systems. lazy_max_pages() deals with "unsigned long" that is 8 bytes on 64 bit system, thus vmap_lazy_nr should be 8 bytes on 64 bit as well. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Uladzislau Rezki (Sony) <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Reviewed-by: William Kucharski <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Thomas Garnier <[email protected]> Cc: Oleksiy Avramchenko <[email protected]> Cc: Joel Fernandes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14mm/vmalloc.c: add priority threshold to __purge_vmap_area_lazy()Uladzislau Rezki (Sony)1-6/+12
Commit 763b218ddfaf ("mm: add preempt points into __purge_vmap_area_lazy()") introduced some preempt points, one of those is making an allocation more prioritized over lazy free of vmap areas. Prioritizing an allocation over freeing does not work well all the time, i.e. it should be rather a compromise. 1) Number of lazy pages directly influences the busy list length thus on operations like: allocation, lookup, unmap, remove, etc. 2) Under heavy stress of vmalloc subsystem I run into a situation when memory usage gets increased hitting out_of_memory -> panic state due to completely blocking of logic that frees vmap areas in the __purge_vmap_area_lazy() function. Establish a threshold passing which the freeing is prioritized back over allocation creating a balance between each other. Using vmalloc test driver in "stress mode", i.e. When all available test cases are run simultaneously on all online CPUs applying a pressure on the vmalloc subsystem, my HiKey 960 board runs out of memory due to the fact that __purge_vmap_area_lazy() logic simply is not able to free pages in time. How I run it: 1) You should build your kernel with CONFIG_TEST_VMALLOC=m 2) ./tools/testing/selftests/vm/test_vmalloc.sh stress During this test "vmap_lazy_nr" pages will go far beyond acceptable lazy_max_pages() threshold, that will lead to enormous busy list size and other problems including allocation time and so on. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Uladzislau Rezki (Sony) <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Thomas Garnier <[email protected]> Cc: Oleksiy Avramchenko <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Joel Fernandes <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Joel Fernandes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14kernel/sched/psi.c: expose pressure metrics on root cgroupDan Schatzberg3-7/+14
Pressure metrics are already recorded and exposed in procfs for the entire system, but any tool which monitors cgroup pressure has to special case the root cgroup to read from procfs. This patch exposes the already recorded pressure metrics on the root cgroup. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Dan Schatzberg <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Li Zefan <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Peter Zijlstra <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14psi: introduce psi monitorSuren Baghdasaryan5-20/+742
Psi monitor aims to provide a low-latency short-term pressure detection mechanism configurable by users. It allows users to monitor psi metrics growth and trigger events whenever a metric raises above user-defined threshold within user-defined time window. Time window and threshold are both expressed in usecs. Multiple psi resources with different thresholds and window sizes can be monitored concurrently. Psi monitors activate when system enters stall state for the monitored psi metric and deactivate upon exit from the stall state. While system is in the stall state psi signal growth is monitored at a rate of 10 times per tracking window. Min window size is 500ms, therefore the min monitoring interval is 50ms. Max window size is 10s with monitoring interval of 1s. When activated psi monitor stays active for at least the duration of one tracking window to avoid repeated activations/deactivations when psi signal is bouncing. Notifications to the users are rate-limited to one per tracking window. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Suren Baghdasaryan <[email protected]> Signed-off-by: Johannes Weiner <[email protected]> Cc: Dennis Zhou <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Li Zefan <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14include/: refactor headers to allow kthread.h inclusion in psi_types.hSuren Baghdasaryan4-2/+4
kthread.h can't be included in psi_types.h because it creates a circular inclusion with kthread.h eventually including psi_types.h and complaining on kthread structures not being defined because they are defined further in the kthread.h. Resolve this by removing psi_types.h inclusion from the headers included from kthread.h. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Suren Baghdasaryan <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Dennis Zhou <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Li Zefan <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14psi: track changed statesSuren Baghdasaryan1-6/+18
Introduce changed_states parameter into collect_percpu_times to track the states changed since the last update. This will be needed to detect whether polled states activated in the monitor patch. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Suren Baghdasaryan <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Dennis Zhou <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Li Zefan <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14psi: split update_stats into partsSuren Baghdasaryan1-23/+34
Split update_stats into collect_percpu_times and update_averages for collect_percpu_times to be reused later inside psi monitor. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Suren Baghdasaryan <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Dennis Zhou <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Li Zefan <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14psi: rename psi fields in preparation for psi trigger additionSuren Baghdasaryan2-27/+28
Rename psi_group structure member fields used for calculating psi totals and averages for clear distinction between them and for trigger-related fields that will be added by "psi: introduce psi monitor". [[email protected]: v6] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Suren Baghdasaryan <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Dennis Zhou <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Li Zefan <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14psi: make psi_enable staticSuren Baghdasaryan1-2/+2
psi_enable is not used outside of psi.c, make it static. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Suren Baghdasaryan <[email protected]> Suggested-by: Andrew Morton <[email protected]> Acked-by: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14psi: introduce state_mask to represent stalled psi statesSuren Baghdasaryan2-13/+25
Patch series "psi: pressure stall monitors", v6. This is a respin of: https://lwn.net/ml/linux-kernel/20190308184311.144521-1-surenb%40google.com/ Android is adopting psi to detect and remedy memory pressure that results in stuttering and decreased responsiveness on mobile devices. Psi gives us the stall information, but because we're dealing with latencies in the millisecond range, periodically reading the pressure files to detect stalls in a timely fashion is not feasible. Psi also doesn't aggregate its averages at a high-enough frequency right now. This patch series extends the psi interface such that users can configure sensitive latency thresholds and use poll() and friends to be notified when these are breached. As high-frequency aggregation is costly, it implements an aggregation method that is optimized for fast, short-interval averaging, and makes the aggregation frequency adaptive, such that high-frequency updates only happen while monitored stall events are actively occurring. With these patches applied, Android can monitor for, and ward off, mounting memory shortages before they cause problems for the user. For example, using memory stall monitors in userspace low memory killer daemon (lmkd) we can detect mounting pressure and kill less important processes before device becomes visibly sluggish. In our memory stress testing psi memory monitors produce roughly 10x less false positives compared to vmpressure signals. Having ability to specify multiple triggers for the same psi metric allows other parts of Android framework to monitor memory state of the device and act accordingly. The new interface is straight-forward. The user opens one of the pressure files for writing and writes a trigger description into the file descriptor that defines the stall state - some or full, and the maximum stall time over a given window of time. E.g.: /* Signal when stall time exceeds 100ms of a 1s window */ char trigger[] = "full 100000 1000000" fd = open("/proc/pressure/memory") write(fd, trigger, sizeof(trigger)) while (poll() >= 0) { ... }; close(fd); When the monitored stall state is entered, psi adapts its aggregation frequency according to what the configured time window requires in order to emit event signals in a timely fashion. Once the stalling subsides, aggregation reverts back to normal. The trigger is associated with the open file descriptor. To stop monitoring, the user only needs to close the file descriptor and the trigger is discarded. Patches 1-6 prepare the psi code for polling support. Patch 7 implements the adaptive polling logic, the pressure growth detection optimized for short intervals, and hooks up write() and poll() on the pressure files. The patches were developed in collaboration with Johannes Weiner. This patch (of 7): The psi monitoring patches will need to determine the same states as record_times(). To avoid calculating them twice, maintain a state mask that can be consulted cheaply. Do this in a separate patch to keep the churn in the main feature patch at a minimum. This adds 4-byte state_mask member into psi_group_cpu struct which results in its first cacheline-aligned part becoming 52 bytes long. Add explicit values to enumeration element counters that affect psi_group_cpu struct size. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Suren Baghdasaryan <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Dennis Zhou <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Li Zefan <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14mm: update references to page _refcountBaruch Siach2-2/+2
Commit 0139aa7b7fa ("mm: rename _count, field of the struct page, to _refcount") left out a couple of references to the old field name. Fix that. Link: http://lkml.kernel.org/r/cedf87b02eb8a6b3eac57e8e91da53fb15c3c44c.1556537475.git.baruch@tkos.co.il Fixes: 0139aa7b7fa ("mm: rename _count, field of the struct page, to _refcount") Signed-off-by: Baruch Siach <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14mm: change mm_update_next_owner() to update mm->owner with WRITE_ONCEAndrea Arcangeli1-3/+3
The RCU reader uses rcu_dereference() inside rcu_read_lock critical sections, so the writer shall use WRITE_ONCE. Just a cleanup, we still rely on gcc to emit atomic writes in other places. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Andrea Arcangeli <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jann Horn <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Peter Xu <[email protected]> Cc: zhong jiang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14userfaultfd: use RCU to free the task struct when fork failsAndrea Arcangeli1-2/+29
The task structure is freed while get_mem_cgroup_from_mm() holds rcu_read_lock() and dereferences mm->owner. get_mem_cgroup_from_mm() failing fork() ---- --- task = mm->owner mm->owner = NULL; free(task) if (task) *task; /* use after free */ The fix consists in freeing the task with RCU also in the fork failure case, exactly like it always happens for the regular exit(2) path. That is enough to make the rcu_read_lock hold in get_mem_cgroup_from_mm() (left side above) effective to avoid a use after free when dereferencing the task structure. An alternate possible fix would be to defer the delivery of the userfaultfd contexts to the monitor until after fork() is guaranteed to succeed. Such a change would require more changes because it would create a strict ordering dependency where the uffd methods would need to be called beyond the last potentially failing branch in order to be safe. This solution as opposed only adds the dependency to common code to set mm->owner to NULL and to free the task struct that was pointed by mm->owner with RCU, if fork ends up failing. The userfaultfd methods can still be called anywhere during the fork runtime and the monitor will keep discarding orphaned "mm" coming from failed forks in userland. This race condition couldn't trigger if CONFIG_MEMCG was set =n at build time. [[email protected]: improve changelog, reduce #ifdefs per Michal] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Fixes: 893e26e61d04 ("userfaultfd: non-cooperative: Add fork() event") Signed-off-by: Andrea Arcangeli <[email protected]> Tested-by: zhong jiang <[email protected]> Reported-by: [email protected] Cc: Oleg Nesterov <[email protected]> Cc: Jann Horn <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Peter Xu <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: zhong jiang <[email protected]> Cc: [email protected] Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14kernel/Makefile: don't assume that kernel/gen_ikh_data.sh is executableAndrew Morton1-1/+1
If the user downloads and applies patch-5.1.gz using patch(1), the x bit on kernel/gen_ikh_data.sh is not set. /bin/sh: 1: ./kernel/gen_ikh_data.sh: Permission denied Fix this by using CONFIG_SHELL. Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-14io_uring: fix failure to verify SQ_AFF cpuJens Axboe1-2/+3
The test case we have is rightfully failing with the current kernel: io_uring_setup(1, 0x7ffe2cafebe0), flags: IORING_SETUP_SQPOLL|IORING_SETUP_SQ_AFF, resv: 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000, sq_thread_cpu: 4 expected -1, got 3 This is in a vm, and CPU3 is the last valid one, hence asking for 4 should fail the setup with -EINVAL, not succeed. The problem is that we're using array_index_nospec() with nr_cpu_ids as the index, hence we wrap and end up using CPU0 instead of CPU4. This makes the setup succeed where it should be failing. We don't need to use array_index_nospec() as we're not indexing any array with this. Instead just compare with nr_cpu_ids directly. This is fine as we're checking with cpu_online() afterwards. Signed-off-by: Jens Axboe <[email protected]>
2019-05-15powerpc/mm: Fix crashes with hugepages & 4K pagesMichael Ellerman1-1/+1
The recent commit to cleanup ifdefs in the hugepage initialisation led to crashes when using 4K pages as reported by Sachin: BUG: Kernel NULL pointer dereference at 0x0000001c Faulting instruction address: 0xc000000001d1e58c Oops: Kernel access of bad area, sig: 11 [#1] LE PAGE_SIZE=4K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries ... CPU: 3 PID: 4635 Comm: futex_wake04 Tainted: G W O 5.1.0-next-20190507-autotest #1 NIP: c000000001d1e58c LR: c000000001d1e54c CTR: 0000000000000000 REGS: c000000004937890 TRAP: 0300 MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 22424822 XER: 00000000 CFAR: c00000000183e9e0 DAR: 000000000000001c DSISR: 40000000 IRQMASK: 0 ... NIP kmem_cache_alloc+0xbc/0x5a0 LR kmem_cache_alloc+0x7c/0x5a0 Call Trace: huge_pte_alloc+0x580/0x950 hugetlb_fault+0x9a0/0x1250 handle_mm_fault+0x490/0x4a0 __do_page_fault+0x77c/0x1f00 do_page_fault+0x28/0x50 handle_page_fault+0x18/0x38 This is caused by us trying to allocate from a NULL kmem cache in __hugepte_alloc(). The kmem cache is NULL because it was never allocated in hugetlbpage_init(), because add_huge_page_size() returned an error. The reason add_huge_page_size() returned an error is a simple typo, we are calling check_and_get_huge_psize(size) when we should be passing shift instead. The fact that we're able to trigger this path when the kmem caches are NULL is a separate bug, ie. we should not advertise any hugepage sizes if we haven't setup the required caches for them. This was only seen with 4K pages, with 64K pages we don't need to allocate any extra kmem caches because the 16M hugepage just occupies a single entry at the PMD level. Fixes: 723f268f19da ("powerpc/mm: cleanup ifdef mess in add_huge_page_size()") Reported-by: Sachin Sant <[email protected]> Tested-by: Sachin Sant <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Reviewed-by: Christophe Leroy <[email protected]> Reviewed-by: Aneesh Kumar K.V <[email protected]>
2019-05-14selftests: avoid KBUILD_OUTPUT dir cluttering with selftest objectsShuah Khan1-1/+4
Running "make kselftest" or building selftests when KBUILD_OUTPUT is set, will create selftest objects in the KBUILD_OUTPUT directory. This could be undesirable especially when user didn't intend to relocate selftest objects. Use KBUILD_OUTPUT/kselftest to create selftest objects instead of cluttering the main directory. Signed-off-by: Shuah Khan <[email protected]>
2019-05-14selftests: drivers: Create .gitignore to include /dma-buf/udmabufKelsey Skunberg1-0/+1
Create ../selftests/drivers/.gitignore which holds the following file name created after compiling: - /dma-buf/udmabuf Signed-off-by: Kelsey Skunberg <[email protected]> Signed-off-by: Shuah Khan <[email protected]>
2019-05-14selftests: pidfd: Create .gitignore to include pidfd_testKelsey Skunberg1-0/+1
Create ../selftests/pidfd/.gitignore which holds the following file name created after compiling: - pidfd_test Signed-off-by: Kelsey Skunberg <[email protected]> Signed-off-by: Shuah Khan <[email protected]>
2019-05-14nfp: flower: add rcu locks when accessing netdev for tunnelsPieter Jansen van Vuuren1-6/+11
Add rcu locks when accessing netdev when processing route request and tunnel keep alive messages received from hardware. Fixes: 8e6a9046b66a ("nfp: flower vxlan neighbour offload") Fixes: 856f5b135758 ("nfp: flower vxlan neighbour keep-alive") Signed-off-by: Pieter Jansen van Vuuren <[email protected]> Reviewed-by: Jakub Kicinski <[email protected]> Reviewed-by: John Hurley <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-05-14ppp: deflate: Fix possible crash in deflate_initYueHaibing1-6/+14
BUG: unable to handle kernel paging request at ffffffffa018f000 PGD 3270067 P4D 3270067 PUD 3271063 PMD 2307eb067 PTE 0 Oops: 0000 [#1] PREEMPT SMP CPU: 0 PID: 4138 Comm: modprobe Not tainted 5.1.0-rc7+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014 RIP: 0010:ppp_register_compressor+0x3e/0xd0 [ppp_generic] Code: 98 4a 3f e2 48 8b 15 c1 67 00 00 41 8b 0c 24 48 81 fa 40 f0 19 a0 75 0e eb 35 48 8b 12 48 81 fa 40 f0 19 a0 74 RSP: 0018:ffffc90000d93c68 EFLAGS: 00010287 RAX: ffffffffa018f000 RBX: ffffffffa01a3000 RCX: 000000000000001a RDX: ffff888230c750a0 RSI: 0000000000000000 RDI: ffffffffa019f000 RBP: ffffc90000d93c80 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffa0194080 R13: ffff88822ee1a700 R14: 0000000000000000 R15: ffffc90000d93e78 FS: 00007f2339557540(0000) GS:ffff888237a00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffa018f000 CR3: 000000022bde4000 CR4: 00000000000006f0 Call Trace: ? 0xffffffffa01a3000 deflate_init+0x11/0x1000 [ppp_deflate] ? 0xffffffffa01a3000 do_one_initcall+0x6c/0x3cc ? kmem_cache_alloc_trace+0x248/0x3b0 do_init_module+0x5b/0x1f1 load_module+0x1db1/0x2690 ? m_show+0x1d0/0x1d0 __do_sys_finit_module+0xc5/0xd0 __x64_sys_finit_module+0x15/0x20 do_syscall_64+0x6b/0x1d0 entry_SYSCALL_64_after_hwframe+0x49/0xbe If ppp_deflate fails to register in deflate_init, module initialization failed out, however ppp_deflate_draft may has been regiestred and not unregistered before return. Then the seconed modprobe will trigger crash like this. Reported-by: Hulk Robot <[email protected]> Signed-off-by: YueHaibing <[email protected]> Acked-by: Guillaume Nault <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-05-14net: macb: fix error format in dev_err()Luca Ceresoli1-8/+8
Errors are negative numbers. Using %u shows them as very large positive numbers such as 4294967277 that don't make sense. Use the %d format instead, and get a much nicer -19. Signed-off-by: Luca Ceresoli <[email protected]> Fixes: b48e0bab142f ("net: macb: Migrate to devm clock interface") Fixes: 93b31f48b3ba ("net/macb: unify clock management") Fixes: 421d9df0628b ("net/macb: merge at91_ether driver into macb driver") Fixes: aead88bd0e99 ("net: ethernet: macb: Add support for rx_clk") Fixes: f5473d1d44e4 ("net: macb: Support clock management for tsu_clk") Acked-by: Nicolas Ferre <[email protected]> Reviewed-by: Andrew Lunn <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-05-14rtnetlink: always put IFLA_LINK for links with a link-netnsidSabrina Dubroca1-6/+10
Currently, nla_put_iflink() doesn't put the IFLA_LINK attribute when iflink == ifindex. In some cases, a device can be created in a different netns with the same ifindex as its parent. That device will not dump its IFLA_LINK attribute, which can confuse some userspace software that expects it. For example, if the last ifindex created in init_net and foo are both 8, these commands will trigger the issue: ip link add parent type dummy # ifindex 9 ip link add link parent netns foo type macvlan # ifindex 9 in ns foo So, in case a device puts the IFLA_LINK_NETNSID attribute in a dump, always put the IFLA_LINK attribute as well. Thanks to Dan Winship for analyzing the original OpenShift bug down to the missing netlink attribute. v2: change Fixes tag, it's been here forever, as Nicolas Dichtel said add Nicolas' ack v3: change Fixes tag fix subject typo, spotted by Edward Cree Analyzed-by: Dan Winship <[email protected]> Fixes: d8a5ec672768 ("[NET]: netlink support for moving devices between network namespaces.") Signed-off-by: Sabrina Dubroca <[email protected]> Acked-by: Nicolas Dichtel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-05-14net/mlx4_core: Change the error print to info printYunjian Wang1-1/+1
The error print within mlx4_flow_steer_promisc_add() should be a info print. Fixes: 592e49dda812 ('net/mlx4: Implement promiscuous mode with device managed flow-steering') Signed-off-by: Yunjian Wang <[email protected]> Reviewed-by: Tariq Toukan <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-05-14NFC: Orphan the subsystemJohannes Berg1-4/+2
Samuel clearly hasn't been working on this in many years and patches getting to the wireless list are just being ignored entirely now. Mark the subsystem as orphan to reflect the current state and revert back to the netdev list so at least some fixes can be picked up by Dave. Signed-off-by: Johannes Berg <[email protected]> Acked-by: Samuel Ortiz <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-05-14net: Always descend into dsa/Florian Fainelli1-1/+1
Jiri reported that with a kernel built with CONFIG_FIXED_PHY=y, CONFIG_NET_DSA=m and CONFIG_NET_DSA_LOOP=m, we would not get to a functional state where the mock-up driver is registered. Turns out that we are not descending into drivers/net/dsa/ unconditionally, and we won't be able to link-in dsa_loop_bdinfo.o which does the actual mock-up mdio device registration. Reported-by: Jiri Pirko <[email protected]> Fixes: 40013ff20b1b ("net: dsa: Fix functional dsa-loop dependency on FIXED_PHY") Signed-off-by: Florian Fainelli <[email protected]> Reviewed-by: Vivien Didelot <[email protected]> Tested-by: Jiri Pirko <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-05-14tcp: fix retrans timestamp on passive Fast OpenYuchung Cheng1-0/+3
Commit c7d13c8faa74 ("tcp: properly track retry time on passive Fast Open") sets the start of SYNACK retransmission time on passive Fast Open in "retrans_stamp". However the timestamp is not reset upon the handshake has completed. As a result, future data packet retransmission may not update it in tcp_retransmit_skb(). This may lead to socket aborting earlier unexpectedly by retransmits_timed_out() since retrans_stamp remains the SYNACK rtx time. This bug only manifests on passive TFO sender that a) suffered SYNACK timeout and then b) stalls on very first loss recovery. Any successful loss recovery would reset the timestamp to avoid this issue. Fixes: c7d13c8faa74 ("tcp: properly track retry time on passive Fast Open") Signed-off-by: Yuchung Cheng <[email protected]> Signed-off-by: Neal Cardwell <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-05-15drm/pl111: Initialize clock spinlock earlyGuenter Roeck1-2/+3
The following warning is seen on systems with broken clock divider. INFO: trying to register non-static key. the code is fine but needs lockdep annotation. turning off the locking correctness validator. CPU: 0 PID: 1 Comm: swapper Not tainted 5.1.0-09698-g1fb3b52 #1 Hardware name: ARM Integrator/CP (Device Tree) [<c0011be8>] (unwind_backtrace) from [<c000ebb8>] (show_stack+0x10/0x18) [<c000ebb8>] (show_stack) from [<c07d3fd0>] (dump_stack+0x18/0x24) [<c07d3fd0>] (dump_stack) from [<c0060d48>] (register_lock_class+0x674/0x6f8) [<c0060d48>] (register_lock_class) from [<c005de2c>] (__lock_acquire+0x68/0x2128) [<c005de2c>] (__lock_acquire) from [<c0060408>] (lock_acquire+0x110/0x21c) [<c0060408>] (lock_acquire) from [<c07f755c>] (_raw_spin_lock+0x34/0x48) [<c07f755c>] (_raw_spin_lock) from [<c0536c8c>] (pl111_display_enable+0xf8/0x5fc) [<c0536c8c>] (pl111_display_enable) from [<c0502f54>] (drm_atomic_helper_commit_modeset_enables+0x1ec/0x244) Since commit eedd6033b4c8 ("drm/pl111: Support variants with broken clock divider"), the spinlock is not initialized if the clock divider is broken. Initialize it earlier to fix the problem. Fixes: eedd6033b4c8 ("drm/pl111: Support variants with broken clock divider") Cc: Linus Walleij <[email protected]> Signed-off-by: Guenter Roeck <[email protected]> Signed-off-by: Linus Walleij <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2019-05-14cifs:smbd Use the correct DMA direction when sending dataLong Li1-3/+5
When sending data, use the DMA_TO_DEVICE to map buffers. Also log the number of requests in a compounding request from upper layer. Signed-off-by: Long Li <[email protected]> Signed-off-by: Steve French <[email protected]> Acked-by: Pavel Shilovsky <[email protected]> Acked-by: Ronnie Sahlberg <[email protected]>
2019-05-14cifs:smbd When reconnecting to server, call smbd_destroy() after all MIDs ↵Long Li1-17/+20
have been called commit 214bab448476 ("cifs: Call MID callback before destroying transport") assumes that the MID callback should not take srv_mutex, this may not always be true. SMB Direct requires the MID callback completed before calling transport so all pending memory registration can be freed. So restore the original calling sequence so TCP transport will use the same code, but moving smbd_destroy() after all MID has been called. fixes: 214bab448476 ("cifs: Call MID callback before destroying transport") Signed-off-by: Long Li <[email protected]> Signed-off-by: Steve French <[email protected]> Reviewed-by: Pavel Shilovsky <[email protected]>
2019-05-14power: supply: olpc_battery: force the le/be castsLubomir Rintel1-3/+3
The endianness of data returned from the EC depends on the particular EC version determined at run time. Cast from little/big endian explicitey in the routine that flips endianness to the native one to make sparse happy. Signed-off-by: Lubomir Rintel <[email protected]> Reported-by: kbuild test robot <[email protected]> Fixes: 76311b9a3295 ("power: supply: olpc_battery: Add OLPC XO 1.75 support") Signed-off-by: Sebastian Reichel <[email protected]>