blaster4385/linux-IllusionX - Linux kernel with personal config changes for arch linux

Age	Commit message (Collapse)	Author	Files	Lines
2017-11-18	kbuild: /bin/pwd -> pwd	Bjørn Forsman	4	-4/+4
	Most places use pwd and rely on $PATH lookup. Moving the remaining absolute path /bin/pwd users over for consistency. Also, a reason for doing /bin/pwd -> pwd instead of the other way around is because I believe build systems should make little assumptions on host filesystem layout. Case in point, we do this kind of patching already in NixOS. Ref. commit 028568d84da3cfca49f5f846eeeef01441d70451 ("kbuild: revert $(realpath ...) to $(shell cd ... && /bin/pwd)"). Signed-off-by: Bjørn Forsman <[email protected]> Signed-off-by: Masahiro Yamada <[email protected]>
2017-11-17	Merge tag 'kbuild-misc-v4.15' of ↵	Linus Torvalds	12	-219/+457
	git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild misc updates from Masahiro Yamada: - Clean up and fix RPM package build - Fix a warning in DEB package build - Improve coccicheck script - Improve some semantic patches * tag 'kbuild-misc-v4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: docs: dev-tools: coccinelle: delete out of date wiki reference coccinelle: orplus: reorganize to improve performance coccinelle: use exists to improve efficiency builddeb: Pass the kernel:debarch substvar to dpkg-genchanges Coccinelle: use false positive annotation coccinelle: fix verbose message about .cocci file being run coccinelle: grep Options and Requires fields more precisely Coccinelle: make DEBUG_FILE option more useful coccinelle: api: detect identical chip data arrays coccinelle: Improve setup_timer.cocci matching Coccinelle: setup_timer: improve messages from setup_timer kbuild: rpm-pkg: do not force -jN in submake kbuild: rpm-pkg: keep spec file until make mrproper kbuild: rpm-pkg: fix jobserver unavailable warning kbuild: rpm-pkg: replace $RPM_BUILD_ROOT with %{buildroot} kbuild: rpm-pkg: fix build error when CONFIG_MODULES is disabled kbuild: rpm-pkg: refactor mkspec with here doc kbuild: rpm-pkg: clean up mkspec kbuild: rpm-pkg: install vmlinux.bz2 unconditionally kbuild: rpm-pkg: remove ppc64 specific image handling
2017-11-17	Merge tag 'kbuild-v4.15' of ↵	Linus Torvalds	21	-258/+296
	git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild updates from Masahiro Yamada: "One of the most remarkable improvements in this cycle is, Kbuild is now able to cache the result of shell commands. Some variables are expensive to compute, for example, $(call cc-option,...) invokes the compiler. It is not efficient to redo this computation every time, even when we are not actually building anything. Kbuild creates a hidden file ".cache.mk" that contains invoked shell commands and their results. The speed-up should be noticeable. Summary: - Fix arch build issues (hexagon, sh) - Clean up various Makefiles and scripts - Fix wrong usage of {CFLAGS,LDFLAGS}_MODULE in arch Makefiles - Cache variables that are expensive to compute - Improve cc-ldopton and ld-option for Clang - Optimize output directory creation" * tag 'kbuild-v4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (30 commits) kbuild: move coccicheck help from scripts/Makefile.help to top Makefile sh: decompressor: add shipped files to .gitignore frv: .gitignore: ignore vmlinux.lds selinux: remove unnecessary assignment to subdir- kbuild: specify FORCE in Makefile.headersinst as .PHONY target kbuild: remove redundant mkdir from ./Kbuild kbuild: optimize object directory creation for incremental build kbuild: create object directories simpler and faster kbuild: filter-out PHONY targets from "targets" kbuild: remove redundant $(wildcard ...) for cmd_files calculation kbuild: create directory for make cache only when necessary sh: select KBUILD_DEFCONFIG depending on ARCH kbuild: fix linker feature test macros when cross compiling with Clang kbuild: shrink .cache.mk when it exceeds 1000 lines kbuild: do not call cc-option before KBUILD_CFLAGS initialization kbuild: Cache a few more calls to the compiler kbuild: Add a cache for generated variables kbuild: add forward declaration of default target to Makefile.asm-generic kbuild: remove KBUILD_SUBDIR_ASFLAGS and KBUILD_SUBDIR_CCFLAGS hexagon/kbuild: replace CFLAGS_MODULE with KBUILD_CFLAGS_MODULE ...
2017-11-17	Merge branch 'akpm' (patches from Andrew)	Linus Torvalds	115	-952/+1892
	Merge more updates from Andrew Morton: - a bit more MM - procfs updates - dynamic-debug fixes - lib/ updates - checkpatch - epoll - nilfs2 - signals - rapidio - PID management cleanup and optimization - kcov updates - sysvipc updates - quite a few misc things all over the place * emailed patches from Andrew Morton <[email protected]>: (94 commits) EXPERT Kconfig menu: fix broken EXPERT menu include/asm-generic/topology.h: remove unused parent_node() macro arch/tile/include/asm/topology.h: remove unused parent_node() macro arch/sparc/include/asm/topology_64.h: remove unused parent_node() macro arch/sh/include/asm/topology.h: remove unused parent_node() macro arch/ia64/include/asm/topology.h: remove unused parent_node() macro drivers/pcmcia/sa1111_badge4.c: avoid unused function warning mm: add infrastructure for get_user_pages_fast() benchmarking sysvipc: make get_maxid O(1) again sysvipc: properly name ipc_addid() limit parameter sysvipc: duplicate lock comments wrt ipc_addid() sysvipc: unteach ids->next_id for !CHECKPOINT_RESTORE initramfs: use time64_t timestamps drivers/watchdog: make use of devm_register_reboot_notifier() kernel/reboot.c: add devm_register_reboot_notifier() kcov: update documentation Makefile: support flag -fsanitizer-coverage=trace-cmp kcov: support comparison operands collection kcov: remove pointless current != NULL check kernel/panic.c: add TAINT_AUX ...
2017-11-17	EXPERT Kconfig menu: fix broken EXPERT menu	Randy Dunlap	1	-91/+93
	Clean up the EXPERT menu (yet again). Move FHANDLE and CHECKPOINT_RESTORE into the primary EXPERT menu since they already depend on EXPERT. Move BPF_SYSCALL and USERFAULTFD out of the EXPERT Kconfig symbols menu list since they do not depend on EXPERT and were breaking the continuity of that menu list. Move all of the KALLSYMS Kconfig symbols to the end of the EXPERT menu. This separates the kernel services from the build options. This patch depends on [PATCH] pci: move PCI_QUIRKS to the PCI bus menu (https://lkml.org/lkml/2017/11/2/907). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Randy Dunlap <[email protected]> Acked-by: Daniel Borkmann <[email protected]> [BPF] Cc: Andrea Arcangeli <[email protected]> Cc: Alexei Starovoitov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-17	include/asm-generic/topology.h: remove unused parent_node() macro	Dou Liyang	1	-3/+0
	Commit a7be6e5a7f8d ("mm: drop useless local parameters of __register_one_node()") removed the last user of parent_node(). The parent_node() macro in generic situation is unnecessary. Remove it for cleanup. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Dou Liyang <[email protected]> Reported-by: Michael Ellerman <[email protected]> Cc: Arnd Bergmann <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-17	arch/tile/include/asm/topology.h: remove unused parent_node() macro	Dou Liyang	1	-6/+0
	Commit a7be6e5a7f8d ("mm: drop useless local parameters of __register_one_node()") removed the last user of parent_node(). The parent_node() macro in tile platform is unnecessary. Remove it for cleanup. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Dou Liyang <[email protected]> Reported-by: Michael Ellerman <[email protected]> Acked-by: Chris Metcalf <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-17	arch/sparc/include/asm/topology_64.h: remove unused parent_node() macro	Dou Liyang	1	-2/+0
	Commit a7be6e5a7f8d ("mm: drop useless local parameters of __register_one_node()") removed the last user of parent_node(). The parent_node() macro in SPARC64 platform is unnecessary. Remove it for cleanup. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Dou Liyang <[email protected]> Reported-by: Michael Ellerman <[email protected]> Acked-by: David S. Miller <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-17	arch/sh/include/asm/topology.h: remove unused parent_node() macro	Dou Liyang	1	-1/+0
	Commit a7be6e5a7f8d ("mm: drop useless local parameters of __register_one_node()") removed the last user of parent_node(). The parent_node() macro in SUPERH platform is unnecessary. Remove it for cleanup. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Dou Liyang <[email protected]> Reported-by: Michael Ellerman <[email protected]> Cc: Yoshinori Sato <[email protected]> Cc: Rich Felker <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-17	arch/ia64/include/asm/topology.h: remove unused parent_node() macro	Dou Liyang	1	-7/+0
	Commit a7be6e5a7f8d ("mm: drop useless local parameters of __register_one_node()") removed the last user of parent_node(). The parent_node() macro in IA64(Itanium) platform is unnecessary. Remove it for cleanup. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Dou Liyang <[email protected]> Reported-by: Michael Ellerman <[email protected]> Cc: Tony Luck <[email protected]> Cc: Fenghua Yu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-17	drivers/pcmcia/sa1111_badge4.c: avoid unused function warning	Arnd Bergmann	1	-0/+2
	pcmv_setup() is only used when the badge4 driver is built-in, but not when it is a loadable module: drivers/pcmcia/sa1111_badge4.c:153:122: error: 'pcmv_setup' defined but not used [-Werror=unused-function] This adds an #ifdef to avoid the definition of the unused function in the modular case. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Arnd Bergmann <[email protected]> Cc: Russell King <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-17	mm: add infrastructure for get_user_pages_fast() benchmarking	Kirill A. Shutemov	5	-0/+202
	Performance of get_user_pages_fast() is critical for some workloads, but it's tricky to test it directly. This patch provides /sys/kernel/debug/gup_benchmark that helps with testing performance of it. See tools/testing/selftests/vm/gup_benchmark.c for userspace counterpart. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Kirill A. Shutemov <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Thorsten Leemhuis <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Huang Ying <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-17	sysvipc: make get_maxid O(1) again	Davidlohr Bueso	3	-33/+32
	For a custom microbenchmark on a 3.30GHz Xeon SandyBridge, which calls IPC_STAT over and over, it was calculated that, on avg the cost of ipc_get_maxid() for increasing amounts of keys was: 10 keys: ~900 cycles 100 keys: ~15000 cycles 1000 keys: ~150000 cycles 10000 keys: ~2100000 cycles This is unsurprising as maxid is currently O(n). By having the max_id available in O(1) we save all those cycles for each semctl(_STAT) command, the idr_find can be expensive -- which some real (customer) workloads actually poll on. Note that this used to be the case, until commit 7ca7e564e04 ("ipc: store ipcs into IDRs"). The cost is the extra idr_find when doing RMIDs, but we simply go backwards, and should not take too many iterations to find the new value. [[email protected]: coding-style fixes] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Davidlohr Bueso <[email protected]> Cc: Manfred Spraul <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-17	sysvipc: properly name ipc_addid() limit parameter	Davidlohr Bueso	1	-5/+5
	This is better understood as a limit, instead of size; exactly like the function comment indicates. Rename it. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Davidlohr Bueso <[email protected]> Cc: Manfred Spraul <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-17	sysvipc: duplicate lock comments wrt ipc_addid()	Davidlohr Bueso	2	-0/+2
	The comment in msgqueues when using ipc_addid() is quite useful imo. Duplicate it for shm and semaphores. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Davidlohr Bueso <[email protected]> Cc: Manfred Spraul <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-17	sysvipc: unteach ids->next_id for !CHECKPOINT_RESTORE	Davidlohr Bueso	3	-20/+47
	Patch series "sysvipc: ipc-key management improvements". Here are a few improvements I spotted while eyeballing Guillaume's rhashtable implementation for ipc keys. The first and fourth patches are the interesting ones, the middle two are trivial. This patch (of 4): The next_id object-allocation functionality was introduced in commit 03f595668017 ("ipc: add sysctl to specify desired next object id"). Given that these new entries are _only_ exported under the CONFIG_CHECKPOINT_RESTORE option, there is no point for the common case to even know about ->next_id. As such rewrite ipc_buildid() such that it can do away with the field as well as unnecessary branches when adding a new identifier. The end result also better differentiates both cases, so the code ends up being cleaner; albeit the small duplications regarding the default case. [[email protected]: coding-style fixes] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Davidlohr Bueso <[email protected]> Cc: Manfred Spraul <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-17	initramfs: use time64_t timestamps	Arnd Bergmann	1	-5/+5
	The cpio format uses a 32-bit number to encode file timestamps, which breaks initramfs support in 2038. This reinterprets the timestamp as unsigned, to give us another 68 years and avoids breaking until 2106. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Arnd Bergmann <[email protected]> Cc: Al Viro <[email protected]> Cc: Deepa Dinamani <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Daniel Thompson <[email protected]> Cc: Lokesh Vutla <[email protected]> Cc: Stafford Horne <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-17	drivers/watchdog: make use of devm_register_reboot_notifier()	Andrey Smirnov	2	-35/+32
	Save a bit of cleanup code by leveraging newly added devm_register_reboot_notifier(). [[email protected]: small cleanup: avoid 80-col tricks] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Andrey Smirnov <[email protected]> Acked-by: Guenter Roeck <[email protected]> Cc: Chris Healy <[email protected]> Cc: Wim Van Sebroeck <[email protected]> Cc: Andy Shevchenko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-17	kernel/reboot.c: add devm_register_reboot_notifier()	Andrey Smirnov	2	-0/+31
	Add devm_* wrapper around register_reboot_notifier to simplify device specific reboot notifier registration/unregistration. [[email protected]: move `struct device' forward decl to top-of-file] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Andrey Smirnov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-17	kcov: update documentation	Victor Chibotaru	1	-4/+95
	The updated documentation describes new KCOV mode for collecting comparison operands. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Victor Chibotaru <[email protected]> Signed-off-by: Alexander Potapenko <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Andrey Konovalov <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Alexander Popov <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Kees Cook <[email protected]> Cc: Vegard Nossum <[email protected]> Cc: Quentin Casasnovas <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-17	Makefile: support flag -fsanitizer-coverage=trace-cmp	Victor Chibotaru	3	-2/+18
	The flag enables Clang instrumentation of comparison operations (currently not supported by GCC). This instrumentation is needed by the new KCOV device to collect comparison operands. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Victor Chibotaru <[email protected]> Signed-off-by: Alexander Potapenko <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Andrey Konovalov <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Alexander Popov <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Kees Cook <[email protected]> Cc: Vegard Nossum <[email protected]> Cc: Quentin Casasnovas <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-17	kcov: support comparison operands collection	Victor Chibotaru	3	-39/+211
	Enables kcov to collect comparison operands from instrumented code. This is done by using Clang's -fsanitize=trace-cmp instrumentation (currently not available for GCC). The comparison operands help a lot in fuzz testing. E.g. they are used in Syzkaller to cover the interiors of conditional statements with way less attempts and thus make previously unreachable code reachable. To allow separate collection of coverage and comparison operands two different work modes are implemented. Mode selection is now done via a KCOV_ENABLE ioctl call with corresponding argument value. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Victor Chibotaru <[email protected]> Signed-off-by: Alexander Potapenko <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Andrey Konovalov <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Alexander Popov <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Kees Cook <[email protected]> Cc: Vegard Nossum <[email protected]> Cc: Quentin Casasnovas <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-17	kcov: remove pointless current != NULL check	Andrey Ryabinin	1	-1/+1
	__sanitizer_cov_trace_pc() is a hot code, so it's worth to remove pointless '!current' check. Current is never NULL. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Andrey Ryabinin <[email protected]> Acked-by: Dmitry Vyukov <[email protected]> Acked-by: Mark Rutland <[email protected]> Cc: Andrey Konovalov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-17	kernel/panic.c: add TAINT_AUX	Borislav Petkov	2	-1/+4
	This is the gist of a patch which we've been forward-porting in our kernels for a long time now and it probably would make a good sense to have such TAINT_AUX flag upstream which can be used by each distro etc, how they see fit. This way, we won't need to forward-port a distro-only version indefinitely. Add an auxiliary taint flag to be used by distros and others. This obviates the need to forward-port whatever internal solutions people have in favor of a single flag which they can map arbitrarily to a definition of their pleasing. The "X" mnemonic could also mean eXternal, which would be taint from a distro or something else but not the upstream kernel. We will use it to mark modules for which we don't provide support. I.e., a really eXternal module. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Borislav Petkov <[email protected]> Cc: Kees Cook <[email protected]> Cc: Jessica Yu <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Jiri Slaby <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Michal Marek <[email protected]> Cc: Jiri Kosina <[email protected]> Cc: Takashi Iwai <[email protected]> Cc: Petr Mladek <[email protected]> Cc: Jeff Mahoney <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-17	pid: remove pidhash	Gargi Sharma	8	-50/+18
	pidhash is no longer required as all the information can be looked up from idr tree. nr_hashed represented the number of pids that had been hashed. Since, nr_hashed and PIDNS_HASH_ADDING are no longer relevant, it has been renamed to pid_allocated and PIDNS_ADDING respectively. [[email protected]: v6] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Gargi Sharma <[email protected]> Reviewed-by: Rik van Riel <[email protected]> Tested-by: Tony Luck <[email protected]> [ia64] Cc: Julia Lawall <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Kirill Tkhai <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Christoph Hellwig <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-17	pid: replace pid bitmap implementation with IDR API	Gargi Sharma	6	-209/+65
	Patch series "Replacing PID bitmap implementation with IDR API", v4. This series replaces kernel bitmap implementation of PID allocation with IDR API. These patches are written to simplify the kernel by replacing custom code with calls to generic code. The following are the stats for pid and pid_namespace object files before and after the replacement. There is a noteworthy change between the IDR and bitmap implementation. Before text data bss dec hex filename 8447 3894 64 12405 3075 kernel/pid.o After text data bss dec hex filename 3397 304 0 3701 e75 kernel/pid.o Before text data bss dec hex filename 5692 1842 192 7726 1e2e kernel/pid_namespace.o After text data bss dec hex filename 2854 216 16 3086 c0e kernel/pid_namespace.o The following are the stats for ps, pstree and calling readdir on /proc for 10,000 processes. ps: With IDR API With bitmap real 0m1.479s 0m2.319s user 0m0.070s 0m0.060s sys 0m0.289s 0m0.516s pstree: With IDR API With bitmap real 0m1.024s 0m1.794s user 0m0.348s 0m0.612s sys 0m0.184s 0m0.264s proc: With IDR API With bitmap real 0m0.059s 0m0.074s user 0m0.000s 0m0.004s sys 0m0.016s 0m0.016s This patch (of 2): Replace the current bitmap implementation for Process ID allocation. Functions that are no longer required, for example, free_pidmap(), alloc_pidmap(), etc. are removed. The rest of the functions are modified to use the IDR API. The change was made to make the PID allocation less complex by replacing custom code with calls to generic API. [[email protected]: v6] Link: http://lkml.kernel.org/r/[email protected] [[email protected]: restore the old behaviour of the ns_last_pid sysctl] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Gargi Sharma <[email protected]> Reviewed-by: Rik van Riel <[email protected]> Acked-by: Oleg Nesterov <[email protected]> Cc: Julia Lawall <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Kirill Tkhai <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Christoph Hellwig <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-17	kernel/sysctl.c: code cleanups	Ola N. Kaldestad	1	-5/+3
	Remove unnecessary else block, remove redundant return and call to kfree in if block. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ola N. Kaldestad <[email protected]> Acked-by: Kees Cook <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-17	Documentation/sysctl/vm.txt: fix typo	Kangmin Park	1	-1/+1
	Link: http://lkml.kernel.org/r/CAKW4uUyCi=PnKf3epgFVz8z=1tMtHSOHNm+fdNxrNw3-THvRCA@mail.gmail.com Signed-off-by: Kangmin Park <[email protected]> Cc: Jiri Kosina <[email protected]> Cc: Alan Cox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-17	drivers/rapidio/devices/rio_mport_cdev.c: fix error handling in ↵	Christophe JAILLET	1	-1/+1
	'rio_dma_transfer()' In case of error, 'dma_map_sg()' returns 0, not a negative value. There is BUG_ON() in 'dma_map_sg_attrs()' which makes sure of that. Link: http://lkml.kernel.org/r/d4235bd2b9274e99f6c86ea71b1fa1c7bd8d0c08.1505687047.git.christophe.jaillet@wanadoo.fr Signed-off-by: Christophe JAILLET <[email protected]> Reviewed-by: Logan Gunthorpe <[email protected]> Cc: Matt Porter <[email protected]> Cc: Alexandre Bounine <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Jesper Nilsson <[email protected]> Cc: Christian K_nig <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-17	drivers/rapidio/devices/rio_mport_cdev.c: fix resource leak in error ↵	Christophe JAILLET	1	-1/+2
	handling path in 'rio_dma_transfer()' If 'dma_map_sg()', we should branch to the existing error handling path to free some resources before returning. Link: http://lkml.kernel.org/r/61292a4f369229eee03394247385e955027283f8.1505687047.git.christophe.jaillet@wanadoo.fr Signed-off-by: Christophe JAILLET <[email protected]> Reviewed-by: Logan Gunthorpe <[email protected]> Cc: Matt Porter <[email protected]> Cc: Alexandre Bounine <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Jesper Nilsson <[email protected]> Cc: Christian K_nig <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-17	rapidio: constify rio_device_id	Arvind Yadav	5	-5/+5
	rio_device_id are not supposed to change at runtime. rio driver is working with const 'id_table'. So mark the non-const rio_device_id structs as const. Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Arvind Yadav <[email protected]> Acked-by: Alexandre Bounine <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-17	kdump: print a message in case parse_crashkernel_mem resulted in zero bytes	Dave Young	1	-1/+2
	parse_crashkernel_mem() silently returns if we get zero bytes in the parsing function. It is useful for debugging to add a message, especially if the kernel cannot boot correctly. Add a pr_info instead of pr_warn because it is expected behavior for size = 0, eg. crashkernel=2G-4G:128M, size will be 0 in case system memory is less than 2G. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Dave Young <[email protected]> Cc: Baoquan He <[email protected]> Cc: Vivek Goyal <[email protected]> Cc: Bhupesh Sharma <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-17	kernel/signal.c: remove the no longer needed SIGNAL_UNKILLABLE check in ↵	Oleg Nesterov	1	-2/+2
	complete_signal() complete_signal() checks SIGNAL_UNKILLABLE before it starts to destroy the thread group, today this is wrong in many ways. If nothing else, fatal_signal_pending() should always imply that the whole thread group (except ->group_exit_task if it is not NULL) is killed, this check breaks the rule. After the previous changes we can rely on sig_task_ignored(); sig_fatal(sig) && SIGNAL_UNKILLABLE can only be true if we actually want to kill this task and sig == SIGKILL OR it is traced and debugger can intercept the signal. This should hopefully fix the problem reported by Dmitry. This test-case static int init(void arg) { for (;;) pause(); } int main(void) { char stack[16 1024]; for (;;) { int pid = clone(init, stack + sizeof(stack)/2, CLONE_NEWPID \| SIGCHLD, NULL); assert(pid > 0); assert(ptrace(PTRACE_ATTACH, pid, 0, 0) == 0); assert(waitpid(-1, NULL, WSTOPPED) == pid); assert(ptrace(PTRACE_DETACH, pid, 0, SIGSTOP) == 0); assert(syscall(__NR_tkill, pid, SIGKILL) == 0); assert(pid == wait(NULL)); } } triggers the WARN_ON_ONCE(!(task->jobctl & JOBCTL_STOP_PENDING)) in task_participate_group_stop(). do_signal_stop()->signal_group_exit() checks SIGNAL_GROUP_EXIT and return false, but task_set_jobctl_pending() checks fatal_signal_pending() and does not set JOBCTL_STOP_PENDING. And his should fix the minor security problem reported by Kyle, SECCOMP_RET_TRACE can miss fatal_signal_pending() the same way if the task is the root of a pid namespace. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Oleg Nesterov <[email protected]> Reported-by: Dmitry Vyukov <[email protected]> Reported-by: Kyle Huey <[email protected]> Reviewed-by: Kees Cook <[email protected]> Tested-by: Kyle Huey <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-17	kernel/signal.c: protect the SIGNAL_UNKILLABLE tasks from !sig_kernel_only() ↵	Oleg Nesterov	1	-1/+1
	signals Change sig_task_ignored() to drop the SIG_DFL && !sig_kernel_only() signals even if force == T. This simplifies the next change and this matches the same check in get_signal() which will drop these signals anyway. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Oleg Nesterov <[email protected]> Tested-by: Kyle Huey <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-17	kernel/signal.c: protect the traced SIGNAL_UNKILLABLE tasks from SIGKILL	Oleg Nesterov	1	-5/+7
	The comment in sig_ignored() says "Tracers may want to know about even ignored signals" but SIGKILL can not be reported to debugger and it is just wrong to return 0 in this case: SIGKILL should only kill the SIGNAL_UNKILLABLE task if it comes from the parent ns. Change sig_ignored() to ignore ->ptrace if sig == SIGKILL and rely on sig_task_ignored(). SISGTOP coming from within the namespace is not really right too but at least debugger can intercept it, and we can't drop it here because this will break "gdb -p 1": ptrace_attach() won't work. Perhaps we will add another ->ptrace check later, we will see. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Oleg Nesterov <[email protected]> Tested-by: Kyle Huey <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-17	fat: remove redundant assignment of 0 to slots	Colin Ian King	1	-1/+0
	The variable slots is being assigned a value of zero that is never read, slots is being updated again a few lines later. Remove this redundant assignment. Cleans clang warning: Value stored to 'slots' is never read Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Colin Ian King <[email protected]> Acked-by: OGAWA Hirofumi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-17	hfs/hfsplus: clean up unused variables in bnode.c	Christos Gkekas	2	-8/+0
	Delete variables 'tree' and 'sb', which are set but never used. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Christos Gkekas <[email protected]> Cc: Christoph Hellwig <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-17	nilfs2: remove inode->i_version initialization	Jeff Layton	1	-1/+0
	It's never used in nilfs2. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jeff Layton <[email protected]> Signed-off-by: Ryusuke Konishi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-17	nilfs2: use octal for unreadable permission macro	Ryusuke Konishi	1	-1/+1
	Replace S_IRWXUGO with 0777 because symbolic permissions are considered harmful: https://lwn.net/Articles/696229/ Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ryusuke Konishi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-17	nilfs2: align block comments of nilfs_sufile_truncate_range() at *	Ryusuke Konishi	1	-16/+16
	Fix the following checkpatch warning: WARNING: Block comments should align the * on each line #633: FILE: sufile.c:633: +/** + * nilfs_sufile_truncate_range - truncate range of segment array Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ryusuke Konishi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-17	fs, nilfs: convert nilfs_root.count from atomic_t to refcount_t	Elena Reshetova	2	-6/+7
	atomic_t variables are currently used to implement reference counters with the following properties: - counter is initialized to 1 using atomic_set() - a resource is freed upon counter reaching zero - once counter reaches zero, its further increments aren't allowed - counter schema uses basic atomic operations (set, inc, inc_not_zero, dec_and_test, etc.) Such atomic variables should be converted to a newly provided refcount_t type and API that prevents accidental counter overflows and underflows. This is important since overflows and underflows can lead to use-after-free situation and be exploitable. The variable nilfs_root.count is used as pure reference counter. Convert it to refcount_t and fix up the operations. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Elena Reshetova <[email protected]> Signed-off-by: Ryusuke Konishi <[email protected]> Suggested-by: Kees Cook <[email protected]> Reviewed-by: David Windsor <[email protected]> Reviewed-by: Hans Liljestrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-17	nilfs2: fix race condition that causes file system corruption	Andreas Rohner	1	-2/+4
	There is a race condition between nilfs_dirty_inode() and nilfs_set_file_dirty(). When a file is opened, nilfs_dirty_inode() is called to update the access timestamp in the inode. It calls __nilfs_mark_inode_dirty() in a separate transaction. __nilfs_mark_inode_dirty() caches the ifile buffer_head in the i_bh field of the inode info structure and marks it as dirty. After some data was written to the file in another transaction, the function nilfs_set_file_dirty() is called, which adds the inode to the ns_dirty_files list. Then the segment construction calls nilfs_segctor_collect_dirty_files(), which goes through the ns_dirty_files list and checks the i_bh field. If there is a cached buffer_head in i_bh it is not marked as dirty again. Since nilfs_dirty_inode() and nilfs_set_file_dirty() use separate transactions, it is possible that a segment construction that writes out the ifile occurs in-between the two. If this happens the inode is not on the ns_dirty_files list, but its ifile block is still marked as dirty and written out. In the next segment construction, the data for the file is written out and nilfs_bmap_propagate() updates the b-tree. Eventually the bmap root is written into the i_bh block, which is not dirty, because it was written out in another segment construction. As a result the bmap update can be lost, which leads to file system corruption. Either the virtual block address points to an unallocated DAT block, or the DAT entry will be reused for something different. The error can remain undetected for a long time. A typical error message would be one of the "bad btree" errors or a warning that a DAT entry could not be found. This bug can be reproduced reliably by a simple benchmark that creates and overwrites millions of 4k files. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Andreas Rohner <[email protected]> Signed-off-by: Ryusuke Konishi <[email protected]> Tested-by: Andreas Rohner <[email protected]> Tested-by: Ryusuke Konishi <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-17	fs/nilfs2: convert timers to use timer_setup()	Kees Cook	2	-6/+6
	In preparation for unconditionally passing the struct timer_list pointer to all timer callbacks, switch to using the new timer_setup() and from_timer() to pass the timer pointer explicitly. This requires adding a pointer to hold the timer's target task, as the lifetime of sc_task doesn't appear to match the timer's task. Link: http://lkml.kernel.org/r/20171016235900.GA102729@beast Signed-off-by: Kees Cook <[email protected]> Acked-by: Ryusuke Konishi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-17	sysctl: check for UINT_MAX before unsigned int min/max	Joe Lawrence	1	-6/+8
	Mikulas noticed in the existing do_proc_douintvec_minmax_conv() and do_proc_dopipe_max_size_conv() introduced in this patchset, that they inconsistently handle overflow and min/max range inputs: For example: 0 ... param->min - 1 ---> ERANGE param->min ... param->max ---> the value is accepted param->max + 1 ... 0x100000000L + param->min - 1 ---> ERANGE 0x100000000L + param->min ... 0x100000000L + param->max ---> EINVAL 0x100000000L + param->max + 1, 0x200000000L + param->min - 1 ---> ERANGE 0x200000000L + param->min ... 0x200000000L + param->max ---> EINVAL 0x200000000L + param->max + 1, 0x300000000L + param->min - 1 ---> ERANGE In do_proc_do*() routines which store values into unsigned int variables (4 bytes wide for 64-bit builds), first validate that the input unsigned long value (8 bytes wide for 64-bit builds) will fit inside the smaller unsigned int variable. Then check that the unsigned int value falls inside the specified parameter min, max range. Otherwise the unsigned long -> unsigned int conversion drops leading bits from the input value, leading to the inconsistent pattern Mikulas documented above. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Joe Lawrence <[email protected]> Reported-by: Mikulas Patocka <[email protected]> Reviewed-by: Mikulas Patocka <[email protected]> Cc: Al Viro <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Michael Kerrisk <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Josh Poimboeuf <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-17	pipe: add proc_dopipe_max_size() to safely assign pipe_max_size	Joe Lawrence	4	-15/+56
	pipe_max_size is assigned directly via procfs sysctl: static struct ctl_table fs_table[] = { ... { .procname = "pipe-max-size", .data = &pipe_max_size, .maxlen = sizeof(int), .mode = 0644, .proc_handler = &pipe_proc_fn, .extra1 = &pipe_min_size, }, ... int pipe_proc_fn(struct ctl_table table, int write, void __user buf, size_t lenp, loff_t ppos) { ... ret = proc_dointvec_minmax(table, write, buf, lenp, ppos) ... and then later rounded in-place a few statements later: ... pipe_max_size = round_pipe_size(pipe_max_size); ... This leaves a window of time between initial assignment and rounding that may be visible to other threads. (For example, one thread sets a non-rounded value to pipe_max_size while another reads its value.) Similar reads of pipe_max_size are potentially racy: pipe.c :: alloc_pipe_info() pipe.c :: pipe_set_size() Add a new proc_dopipe_max_size() that consolidates reading the new value from the user buffer, verifying bounds, and calling round_pipe_size() with a single assignment to pipe_max_size. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Joe Lawrence <[email protected]> Reported-by: Mikulas Patocka <[email protected]> Reviewed-by: Mikulas Patocka <[email protected]> Cc: Al Viro <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Michael Kerrisk <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Josh Poimboeuf <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-17	pipe: avoid round_pipe_size() nr_pages overflow on 32-bit	Joe Lawrence	1	-2/+15
	round_pipe_size() contains a right-bit-shift expression which may overflow, which would cause undefined results in a subsequent roundup_pow_of_two() call. static inline unsigned int round_pipe_size(unsigned int size) { unsigned long nr_pages; nr_pages = (size + PAGE_SIZE - 1) >> PAGE_SHIFT; return roundup_pow_of_two(nr_pages) << PAGE_SHIFT; } PAGE_SIZE is defined as (1UL << PAGE_SHIFT), so: - 4 bytes wide on 32-bit (0 to 0xffffffff) - 8 bytes wide on 64-bit (0 to 0xffffffffffffffff) That means that 32-bit round_pipe_size(), nr_pages may overflow to 0: size=0x00000000 nr_pages=0x0 size=0x00000001 nr_pages=0x1 size=0xfffff000 nr_pages=0xfffff size=0xfffff001 nr_pages=0x0 << ! size=0xffffffff nr_pages=0x0 << ! This is bad because roundup_pow_of_two(n) is undefined when n == 0! 64-bit is not a problem as the unsigned int size is 4 bytes wide (similar to 32-bit) and the larger, 8 byte wide unsigned long, is sufficient to handle the largest value of the bit shift expression: size=0xffffffff nr_pages=100000 Modify round_pipe_size() to return 0 if n == 0 and updates its callers to handle accordingly. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Joe Lawrence <[email protected]> Reported-by: Mikulas Patocka <[email protected]> Reviewed-by: Mikulas Patocka <[email protected]> Cc: Al Viro <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Michael Kerrisk <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Josh Poimboeuf <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-17	pipe: match pipe_max_size data type with procfs	Joe Lawrence	2	-2/+2
	Patch series "A few round_pipe_size() and pipe-max-size fixups", v3. While backporting Michael's "pipe: fix limit handling" patchset to a distro-kernel, Mikulas noticed that current upstream pipe limit handling contains a few problems: 1 - procfs signed wrap: echo'ing a large number into /proc/sys/fs/pipe-max-size and then cat'ing it back out shows a negative value. 2 - round_pipe_size() nr_pages overflow on 32bit: this would subsequently try roundup_pow_of_two(0), which is undefined. 3 - visible non-rounded pipe-max-size value: there is no mutual exclusion or protection between the time pipe_max_size is assigned a raw value from proc_dointvec_minmax() and when it is rounded. 4 - unsigned long -> unsigned int conversion makes for potential odd return errors from do_proc_douintvec_minmax_conv() and do_proc_dopipe_max_size_conv(). This version underwent the same testing as v1: https://marc.info/?l=linux-kernel&m=150643571406022&w=2 This patch (of 4): pipe_max_size is defined as an unsigned int: unsigned int pipe_max_size = 1048576; but its procfs/sysctl representation is an integer: static struct ctl_table fs_table[] = { ... { .procname = "pipe-max-size", .data = &pipe_max_size, .maxlen = sizeof(int), .mode = 0644, .proc_handler = &pipe_proc_fn, .extra1 = &pipe_min_size, }, ... that is signed: int pipe_proc_fn(struct ctl_table table, int write, void __user buf, size_t lenp, loff_t ppos) { ... ret = proc_dointvec_minmax(table, write, buf, lenp, ppos) This leads to signed results via procfs for large values of pipe_max_size: % echo 2147483647 >/proc/sys/fs/pipe-max-size % cat /proc/sys/fs/pipe-max-size -2147483648 Use unsigned operations on this variable to avoid such negative values. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Joe Lawrence <[email protected]> Reported-by: Mikulas Patocka <[email protected]> Reviewed-by: Mikulas Patocka <[email protected]> Cc: Michael Kerrisk <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Al Viro <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Josh Poimboeuf <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-17	autofs: don't fail mount for transient error	NeilBrown	1	-1/+14
	Currently if the autofs kernel module gets an error when writing to the pipe which links to the daemon, then it marks the whole moutpoint as catatonic, and it will stop working. It is possible that the error is transient. This can happen if the daemon is slow and more than 16 requests queue up. If a subsequent process tries to queue a request, and is then signalled, the write to the pipe will return -ERESTARTSYS and autofs will take that as total failure. So change the code to assess -ERESTARTSYS and -ENOMEM as transient failures which only abort the current request, not the whole mountpoint. It isn't a crash or a data corruption, but having autofs mountpoints suddenly stop working is rather inconvenient. Ian said: : And given the problems with a half dozen (or so) user space applications : consuming large amounts of CPU under heavy mount and umount activity this : could happen more easily than we expect. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: NeilBrown <[email protected]> Acked-by: Ian Kent <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-17	init/version.c: include <linux/export.h> instead of <linux/module.h>	Masahiro Yamada	1	-1/+1
	init/version.c has nothing to do with modules, so remove the <linux/modude.h>. Instead, include <linux/export.h> for EXPORT_SYMBOL_GPL. This cuts off a lot of unnecessary header parsing. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Masahiro Yamada <[email protected]> Cc: Paul Gortmaker <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-11-17	epoll: remove ep_call_nested() from ep_eventpoll_poll()	Jason Baron	1	-45/+35
	The use of ep_call_nested() in ep_eventpoll_poll(), which is the .poll routine for an epoll fd, is used to prevent excessively deep epoll nesting, and to prevent circular paths. However, we are already preventing these conditions during EPOLL_CTL_ADD. In terms of too deep epoll chains, we do in fact allow deep nesting of the epoll fds themselves (deeper than EP_MAX_NESTS), however we don't allow more than EP_MAX_NESTS when an epoll file descriptor is actually connected to a wakeup source. Thus, we do not require the use of ep_call_nested(), since ep_eventpoll_poll(), which is called via ep_scan_ready_list() only continues nesting if there are events available. Since ep_call_nested() is implemented using a global lock, applications that make use of nested epoll can see large performance improvements with this change. Davidlohr said: : Improvements are quite obscene actually, such as for the following : epoll_wait() benchmark with 2 level nesting on a 80 core IvyBridge: : : ncpus vanilla dirty delta : 1 2447092 3028315 +23.75% : 4 231265 2986954 +1191.57% : 8 121631 2898796 +2283.27% : 16 59749 2902056 +4757.07% : 32 26837 2326314 +8568.30% : 64 12926 1341281 +10276.61% : : (http://linux-scalability.org/epoll/epoll-test.c) Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jason Baron <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Salman Qazi <[email protected]> Cc: Hou Tao <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>