Age | Commit message (Collapse) | Author | Files | Lines |
|
Add list_next_rcu() for fetching next list in rcu_deference safely.
Found with sparse in linux-next tree on tag next-20171116.
Signed-off-by: Tim Hansen <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Merge more updates from Andrew Morton:
- a bit more MM
- procfs updates
- dynamic-debug fixes
- lib/ updates
- checkpatch
- epoll
- nilfs2
- signals
- rapidio
- PID management cleanup and optimization
- kcov updates
- sysvipc updates
- quite a few misc things all over the place
* emailed patches from Andrew Morton <[email protected]>: (94 commits)
EXPERT Kconfig menu: fix broken EXPERT menu
include/asm-generic/topology.h: remove unused parent_node() macro
arch/tile/include/asm/topology.h: remove unused parent_node() macro
arch/sparc/include/asm/topology_64.h: remove unused parent_node() macro
arch/sh/include/asm/topology.h: remove unused parent_node() macro
arch/ia64/include/asm/topology.h: remove unused parent_node() macro
drivers/pcmcia/sa1111_badge4.c: avoid unused function warning
mm: add infrastructure for get_user_pages_fast() benchmarking
sysvipc: make get_maxid O(1) again
sysvipc: properly name ipc_addid() limit parameter
sysvipc: duplicate lock comments wrt ipc_addid()
sysvipc: unteach ids->next_id for !CHECKPOINT_RESTORE
initramfs: use time64_t timestamps
drivers/watchdog: make use of devm_register_reboot_notifier()
kernel/reboot.c: add devm_register_reboot_notifier()
kcov: update documentation
Makefile: support flag -fsanitizer-coverage=trace-cmp
kcov: support comparison operands collection
kcov: remove pointless current != NULL check
kernel/panic.c: add TAINT_AUX
...
|
|
Clean up the EXPERT menu (yet again).
Move FHANDLE and CHECKPOINT_RESTORE into the primary EXPERT menu since
they already depend on EXPERT.
Move BPF_SYSCALL and USERFAULTFD out of the EXPERT Kconfig symbols menu
list since they do not depend on EXPERT and were breaking the continuity
of that menu list.
Move all of the KALLSYMS Kconfig symbols to the end of the EXPERT menu.
This separates the kernel services from the build options.
This patch depends on [PATCH] pci: move PCI_QUIRKS to the PCI bus menu
(https://lkml.org/lkml/2017/11/2/907).
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Randy Dunlap <[email protected]>
Acked-by: Daniel Borkmann <[email protected]> [BPF]
Cc: Andrea Arcangeli <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Commit a7be6e5a7f8d ("mm: drop useless local parameters of
__register_one_node()") removed the last user of parent_node().
The parent_node() macro in generic situation is unnecessary.
Remove it for cleanup.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Dou Liyang <[email protected]>
Reported-by: Michael Ellerman <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Commit a7be6e5a7f8d ("mm: drop useless local parameters of
__register_one_node()") removed the last user of parent_node().
The parent_node() macro in tile platform is unnecessary.
Remove it for cleanup.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Dou Liyang <[email protected]>
Reported-by: Michael Ellerman <[email protected]>
Acked-by: Chris Metcalf <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Commit a7be6e5a7f8d ("mm: drop useless local parameters of
__register_one_node()") removed the last user of parent_node().
The parent_node() macro in SPARC64 platform is unnecessary.
Remove it for cleanup.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Dou Liyang <[email protected]>
Reported-by: Michael Ellerman <[email protected]>
Acked-by: David S. Miller <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Commit a7be6e5a7f8d ("mm: drop useless local parameters of
__register_one_node()") removed the last user of parent_node().
The parent_node() macro in SUPERH platform is unnecessary.
Remove it for cleanup.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Dou Liyang <[email protected]>
Reported-by: Michael Ellerman <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Cc: Rich Felker <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Commit a7be6e5a7f8d ("mm: drop useless local parameters of
__register_one_node()") removed the last user of parent_node().
The parent_node() macro in IA64(Itanium) platform is unnecessary.
Remove it for cleanup.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Dou Liyang <[email protected]>
Reported-by: Michael Ellerman <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Fenghua Yu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
pcmv_setup() is only used when the badge4 driver is built-in, but not
when it is a loadable module:
drivers/pcmcia/sa1111_badge4.c:153:122: error: 'pcmv_setup' defined but not used [-Werror=unused-function]
This adds an #ifdef to avoid the definition of the unused function in
the modular case.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Arnd Bergmann <[email protected]>
Cc: Russell King <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Performance of get_user_pages_fast() is critical for some workloads, but
it's tricky to test it directly.
This patch provides /sys/kernel/debug/gup_benchmark that helps with
testing performance of it.
See tools/testing/selftests/vm/gup_benchmark.c for userspace
counterpart.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Kirill A. Shutemov <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thorsten Leemhuis <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Huang Ying <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
For a custom microbenchmark on a 3.30GHz Xeon SandyBridge, which calls
IPC_STAT over and over, it was calculated that, on avg the cost of
ipc_get_maxid() for increasing amounts of keys was:
10 keys: ~900 cycles
100 keys: ~15000 cycles
1000 keys: ~150000 cycles
10000 keys: ~2100000 cycles
This is unsurprising as maxid is currently O(n).
By having the max_id available in O(1) we save all those cycles for each
semctl(_STAT) command, the idr_find can be expensive -- which some real
(customer) workloads actually poll on.
Note that this used to be the case, until commit 7ca7e564e04 ("ipc:
store ipcs into IDRs"). The cost is the extra idr_find when doing
RMIDs, but we simply go backwards, and should not take too many
iterations to find the new value.
[[email protected]: coding-style fixes]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Davidlohr Bueso <[email protected]>
Cc: Manfred Spraul <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This is better understood as a limit, instead of size; exactly like the
function comment indicates. Rename it.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Davidlohr Bueso <[email protected]>
Cc: Manfred Spraul <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The comment in msgqueues when using ipc_addid() is quite useful imo.
Duplicate it for shm and semaphores.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Davidlohr Bueso <[email protected]>
Cc: Manfred Spraul <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "sysvipc: ipc-key management improvements".
Here are a few improvements I spotted while eyeballing Guillaume's
rhashtable implementation for ipc keys. The first and fourth patches
are the interesting ones, the middle two are trivial.
This patch (of 4):
The next_id object-allocation functionality was introduced in commit
03f595668017 ("ipc: add sysctl to specify desired next object id").
Given that these new entries are _only_ exported under the
CONFIG_CHECKPOINT_RESTORE option, there is no point for the common case
to even know about ->next_id. As such rewrite ipc_buildid() such that
it can do away with the field as well as unnecessary branches when
adding a new identifier. The end result also better differentiates both
cases, so the code ends up being cleaner; albeit the small duplications
regarding the default case.
[[email protected]: coding-style fixes]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Davidlohr Bueso <[email protected]>
Cc: Manfred Spraul <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The cpio format uses a 32-bit number to encode file timestamps, which
breaks initramfs support in 2038. This reinterprets the timestamp as
unsigned, to give us another 68 years and avoids breaking until 2106.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Arnd Bergmann <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Deepa Dinamani <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Daniel Thompson <[email protected]>
Cc: Lokesh Vutla <[email protected]>
Cc: Stafford Horne <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Save a bit of cleanup code by leveraging newly added
devm_register_reboot_notifier().
[[email protected]: small cleanup: avoid 80-col tricks]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Andrey Smirnov <[email protected]>
Acked-by: Guenter Roeck <[email protected]>
Cc: Chris Healy <[email protected]>
Cc: Wim Van Sebroeck <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Add devm_* wrapper around register_reboot_notifier to simplify device
specific reboot notifier registration/unregistration.
[[email protected]: move `struct device' forward decl to top-of-file]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Andrey Smirnov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The updated documentation describes new KCOV mode for collecting
comparison operands.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Victor Chibotaru <[email protected]>
Signed-off-by: Alexander Potapenko <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Alexander Popov <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Vegard Nossum <[email protected]>
Cc: Quentin Casasnovas <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The flag enables Clang instrumentation of comparison operations
(currently not supported by GCC). This instrumentation is needed by the
new KCOV device to collect comparison operands.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Victor Chibotaru <[email protected]>
Signed-off-by: Alexander Potapenko <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Alexander Popov <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Vegard Nossum <[email protected]>
Cc: Quentin Casasnovas <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Enables kcov to collect comparison operands from instrumented code.
This is done by using Clang's -fsanitize=trace-cmp instrumentation
(currently not available for GCC).
The comparison operands help a lot in fuzz testing. E.g. they are used
in Syzkaller to cover the interiors of conditional statements with way
less attempts and thus make previously unreachable code reachable.
To allow separate collection of coverage and comparison operands two
different work modes are implemented. Mode selection is now done via a
KCOV_ENABLE ioctl call with corresponding argument value.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Victor Chibotaru <[email protected]>
Signed-off-by: Alexander Potapenko <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Alexander Popov <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Vegard Nossum <[email protected]>
Cc: Quentin Casasnovas <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
__sanitizer_cov_trace_pc() is a hot code, so it's worth to remove
pointless '!current' check. Current is never NULL.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Andrey Ryabinin <[email protected]>
Acked-by: Dmitry Vyukov <[email protected]>
Acked-by: Mark Rutland <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This is the gist of a patch which we've been forward-porting in our
kernels for a long time now and it probably would make a good sense to
have such TAINT_AUX flag upstream which can be used by each distro etc,
how they see fit. This way, we won't need to forward-port a distro-only
version indefinitely.
Add an auxiliary taint flag to be used by distros and others. This
obviates the need to forward-port whatever internal solutions people
have in favor of a single flag which they can map arbitrarily to a
definition of their pleasing.
The "X" mnemonic could also mean eXternal, which would be taint from a
distro or something else but not the upstream kernel. We will use it to
mark modules for which we don't provide support. I.e., a really
eXternal module.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Borislav Petkov <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Jessica Yu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Jiri Slaby <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Michal Marek <[email protected]>
Cc: Jiri Kosina <[email protected]>
Cc: Takashi Iwai <[email protected]>
Cc: Petr Mladek <[email protected]>
Cc: Jeff Mahoney <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
pidhash is no longer required as all the information can be looked up
from idr tree. nr_hashed represented the number of pids that had been
hashed. Since, nr_hashed and PIDNS_HASH_ADDING are no longer relevant,
it has been renamed to pid_allocated and PIDNS_ADDING respectively.
[[email protected]: v6]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Gargi Sharma <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Tested-by: Tony Luck <[email protected]> [ia64]
Cc: Julia Lawall <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Kirill Tkhai <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "Replacing PID bitmap implementation with IDR API", v4.
This series replaces kernel bitmap implementation of PID allocation with
IDR API. These patches are written to simplify the kernel by replacing
custom code with calls to generic code.
The following are the stats for pid and pid_namespace object files
before and after the replacement. There is a noteworthy change between
the IDR and bitmap implementation.
Before
text data bss dec hex filename
8447 3894 64 12405 3075 kernel/pid.o
After
text data bss dec hex filename
3397 304 0 3701 e75 kernel/pid.o
Before
text data bss dec hex filename
5692 1842 192 7726 1e2e kernel/pid_namespace.o
After
text data bss dec hex filename
2854 216 16 3086 c0e kernel/pid_namespace.o
The following are the stats for ps, pstree and calling readdir on /proc
for 10,000 processes.
ps:
With IDR API With bitmap
real 0m1.479s 0m2.319s
user 0m0.070s 0m0.060s
sys 0m0.289s 0m0.516s
pstree:
With IDR API With bitmap
real 0m1.024s 0m1.794s
user 0m0.348s 0m0.612s
sys 0m0.184s 0m0.264s
proc:
With IDR API With bitmap
real 0m0.059s 0m0.074s
user 0m0.000s 0m0.004s
sys 0m0.016s 0m0.016s
This patch (of 2):
Replace the current bitmap implementation for Process ID allocation.
Functions that are no longer required, for example, free_pidmap(),
alloc_pidmap(), etc. are removed. The rest of the functions are
modified to use the IDR API. The change was made to make the PID
allocation less complex by replacing custom code with calls to generic
API.
[[email protected]: v6]
Link: http://lkml.kernel.org/r/[email protected]
[[email protected]: restore the old behaviour of the ns_last_pid sysctl]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Gargi Sharma <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Acked-by: Oleg Nesterov <[email protected]>
Cc: Julia Lawall <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Kirill Tkhai <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Remove unnecessary else block, remove redundant return and call to kfree
in if block.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ola N. Kaldestad <[email protected]>
Acked-by: Kees Cook <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Link: http://lkml.kernel.org/r/CAKW4uUyCi=PnKf3epgFVz8z=1tMtHSOHNm+fdNxrNw3-THvRCA@mail.gmail.com
Signed-off-by: Kangmin Park <[email protected]>
Cc: Jiri Kosina <[email protected]>
Cc: Alan Cox <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
'rio_dma_transfer()'
In case of error, 'dma_map_sg()' returns 0, not a negative value. There
is BUG_ON() in 'dma_map_sg_attrs()' which makes sure of that.
Link: http://lkml.kernel.org/r/d4235bd2b9274e99f6c86ea71b1fa1c7bd8d0c08.1505687047.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Christophe JAILLET <[email protected]>
Reviewed-by: Logan Gunthorpe <[email protected]>
Cc: Matt Porter <[email protected]>
Cc: Alexandre Bounine <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Jesper Nilsson <[email protected]>
Cc: Christian K_nig <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
handling path in 'rio_dma_transfer()'
If 'dma_map_sg()', we should branch to the existing error handling path
to free some resources before returning.
Link: http://lkml.kernel.org/r/61292a4f369229eee03394247385e955027283f8.1505687047.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Christophe JAILLET <[email protected]>
Reviewed-by: Logan Gunthorpe <[email protected]>
Cc: Matt Porter <[email protected]>
Cc: Alexandre Bounine <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Jesper Nilsson <[email protected]>
Cc: Christian K_nig <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
rio_device_id are not supposed to change at runtime. rio driver is
working with const 'id_table'. So mark the non-const rio_device_id
structs as const.
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Arvind Yadav <[email protected]>
Acked-by: Alexandre Bounine <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
parse_crashkernel_mem() silently returns if we get zero bytes in the
parsing function. It is useful for debugging to add a message,
especially if the kernel cannot boot correctly.
Add a pr_info instead of pr_warn because it is expected behavior for
size = 0, eg. crashkernel=2G-4G:128M, size will be 0 in case system
memory is less than 2G.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Dave Young <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Vivek Goyal <[email protected]>
Cc: Bhupesh Sharma <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
complete_signal()
complete_signal() checks SIGNAL_UNKILLABLE before it starts to destroy
the thread group, today this is wrong in many ways.
If nothing else, fatal_signal_pending() should always imply that the
whole thread group (except ->group_exit_task if it is not NULL) is
killed, this check breaks the rule.
After the previous changes we can rely on sig_task_ignored();
sig_fatal(sig) && SIGNAL_UNKILLABLE can only be true if we actually want
to kill this task and sig == SIGKILL OR it is traced and debugger can
intercept the signal.
This should hopefully fix the problem reported by Dmitry. This
test-case
static int init(void *arg)
{
for (;;)
pause();
}
int main(void)
{
char stack[16 * 1024];
for (;;) {
int pid = clone(init, stack + sizeof(stack)/2,
CLONE_NEWPID | SIGCHLD, NULL);
assert(pid > 0);
assert(ptrace(PTRACE_ATTACH, pid, 0, 0) == 0);
assert(waitpid(-1, NULL, WSTOPPED) == pid);
assert(ptrace(PTRACE_DETACH, pid, 0, SIGSTOP) == 0);
assert(syscall(__NR_tkill, pid, SIGKILL) == 0);
assert(pid == wait(NULL));
}
}
triggers the WARN_ON_ONCE(!(task->jobctl & JOBCTL_STOP_PENDING)) in
task_participate_group_stop(). do_signal_stop()->signal_group_exit()
checks SIGNAL_GROUP_EXIT and return false, but task_set_jobctl_pending()
checks fatal_signal_pending() and does not set JOBCTL_STOP_PENDING.
And his should fix the minor security problem reported by Kyle,
SECCOMP_RET_TRACE can miss fatal_signal_pending() the same way if the
task is the root of a pid namespace.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Oleg Nesterov <[email protected]>
Reported-by: Dmitry Vyukov <[email protected]>
Reported-by: Kyle Huey <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Tested-by: Kyle Huey <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
signals
Change sig_task_ignored() to drop the SIG_DFL && !sig_kernel_only()
signals even if force == T. This simplifies the next change and this
matches the same check in get_signal() which will drop these signals
anyway.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Oleg Nesterov <[email protected]>
Tested-by: Kyle Huey <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The comment in sig_ignored() says "Tracers may want to know about even
ignored signals" but SIGKILL can not be reported to debugger and it is
just wrong to return 0 in this case: SIGKILL should only kill the
SIGNAL_UNKILLABLE task if it comes from the parent ns.
Change sig_ignored() to ignore ->ptrace if sig == SIGKILL and rely on
sig_task_ignored().
SISGTOP coming from within the namespace is not really right too but at
least debugger can intercept it, and we can't drop it here because this
will break "gdb -p 1": ptrace_attach() won't work. Perhaps we will add
another ->ptrace check later, we will see.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Oleg Nesterov <[email protected]>
Tested-by: Kyle Huey <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The variable slots is being assigned a value of zero that is never read,
slots is being updated again a few lines later. Remove this redundant
assignment.
Cleans clang warning: Value stored to 'slots' is never read
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Colin Ian King <[email protected]>
Acked-by: OGAWA Hirofumi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Delete variables 'tree' and 'sb', which are set but never used.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Christos Gkekas <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
It's never used in nilfs2.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jeff Layton <[email protected]>
Signed-off-by: Ryusuke Konishi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Replace S_IRWXUGO with 0777 because symbolic permissions are considered
harmful:
https://lwn.net/Articles/696229/
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ryusuke Konishi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Fix the following checkpatch warning:
WARNING: Block comments should align the * on each line
#633: FILE: sufile.c:633:
+/**
+ * nilfs_sufile_truncate_range - truncate range of segment array
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ryusuke Konishi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
atomic_t variables are currently used to implement reference counters
with the following properties:
- counter is initialized to 1 using atomic_set()
- a resource is freed upon counter reaching zero
- once counter reaches zero, its further
increments aren't allowed
- counter schema uses basic atomic operations
(set, inc, inc_not_zero, dec_and_test, etc.)
Such atomic variables should be converted to a newly provided refcount_t
type and API that prevents accidental counter overflows and underflows.
This is important since overflows and underflows can lead to
use-after-free situation and be exploitable.
The variable nilfs_root.count is used as pure reference counter.
Convert it to refcount_t and fix up the operations.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Elena Reshetova <[email protected]>
Signed-off-by: Ryusuke Konishi <[email protected]>
Suggested-by: Kees Cook <[email protected]>
Reviewed-by: David Windsor <[email protected]>
Reviewed-by: Hans Liljestrand <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
There is a race condition between nilfs_dirty_inode() and
nilfs_set_file_dirty().
When a file is opened, nilfs_dirty_inode() is called to update the
access timestamp in the inode. It calls __nilfs_mark_inode_dirty() in a
separate transaction. __nilfs_mark_inode_dirty() caches the ifile
buffer_head in the i_bh field of the inode info structure and marks it
as dirty.
After some data was written to the file in another transaction, the
function nilfs_set_file_dirty() is called, which adds the inode to the
ns_dirty_files list.
Then the segment construction calls nilfs_segctor_collect_dirty_files(),
which goes through the ns_dirty_files list and checks the i_bh field.
If there is a cached buffer_head in i_bh it is not marked as dirty
again.
Since nilfs_dirty_inode() and nilfs_set_file_dirty() use separate
transactions, it is possible that a segment construction that writes out
the ifile occurs in-between the two. If this happens the inode is not
on the ns_dirty_files list, but its ifile block is still marked as dirty
and written out.
In the next segment construction, the data for the file is written out
and nilfs_bmap_propagate() updates the b-tree. Eventually the bmap root
is written into the i_bh block, which is not dirty, because it was
written out in another segment construction.
As a result the bmap update can be lost, which leads to file system
corruption. Either the virtual block address points to an unallocated
DAT block, or the DAT entry will be reused for something different.
The error can remain undetected for a long time. A typical error
message would be one of the "bad btree" errors or a warning that a DAT
entry could not be found.
This bug can be reproduced reliably by a simple benchmark that creates
and overwrites millions of 4k files.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Andreas Rohner <[email protected]>
Signed-off-by: Ryusuke Konishi <[email protected]>
Tested-by: Andreas Rohner <[email protected]>
Tested-by: Ryusuke Konishi <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
In preparation for unconditionally passing the struct timer_list pointer
to all timer callbacks, switch to using the new timer_setup() and
from_timer() to pass the timer pointer explicitly. This requires adding
a pointer to hold the timer's target task, as the lifetime of sc_task
doesn't appear to match the timer's task.
Link: http://lkml.kernel.org/r/20171016235900.GA102729@beast
Signed-off-by: Kees Cook <[email protected]>
Acked-by: Ryusuke Konishi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Mikulas noticed in the existing do_proc_douintvec_minmax_conv() and
do_proc_dopipe_max_size_conv() introduced in this patchset, that they
inconsistently handle overflow and min/max range inputs:
For example:
0 ... param->min - 1 ---> ERANGE
param->min ... param->max ---> the value is accepted
param->max + 1 ... 0x100000000L + param->min - 1 ---> ERANGE
0x100000000L + param->min ... 0x100000000L + param->max ---> EINVAL
0x100000000L + param->max + 1, 0x200000000L + param->min - 1 ---> ERANGE
0x200000000L + param->min ... 0x200000000L + param->max ---> EINVAL
0x200000000L + param->max + 1, 0x300000000L + param->min - 1 ---> ERANGE
In do_proc_do*() routines which store values into unsigned int variables
(4 bytes wide for 64-bit builds), first validate that the input unsigned
long value (8 bytes wide for 64-bit builds) will fit inside the smaller
unsigned int variable. Then check that the unsigned int value falls
inside the specified parameter min, max range. Otherwise the unsigned
long -> unsigned int conversion drops leading bits from the input value,
leading to the inconsistent pattern Mikulas documented above.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Joe Lawrence <[email protected]>
Reported-by: Mikulas Patocka <[email protected]>
Reviewed-by: Mikulas Patocka <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Michael Kerrisk <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
pipe_max_size is assigned directly via procfs sysctl:
static struct ctl_table fs_table[] = {
...
{
.procname = "pipe-max-size",
.data = &pipe_max_size,
.maxlen = sizeof(int),
.mode = 0644,
.proc_handler = &pipe_proc_fn,
.extra1 = &pipe_min_size,
},
...
int pipe_proc_fn(struct ctl_table *table, int write, void __user *buf,
size_t *lenp, loff_t *ppos)
{
...
ret = proc_dointvec_minmax(table, write, buf, lenp, ppos)
...
and then later rounded in-place a few statements later:
...
pipe_max_size = round_pipe_size(pipe_max_size);
...
This leaves a window of time between initial assignment and rounding
that may be visible to other threads. (For example, one thread sets a
non-rounded value to pipe_max_size while another reads its value.)
Similar reads of pipe_max_size are potentially racy:
pipe.c :: alloc_pipe_info()
pipe.c :: pipe_set_size()
Add a new proc_dopipe_max_size() that consolidates reading the new value
from the user buffer, verifying bounds, and calling round_pipe_size()
with a single assignment to pipe_max_size.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Joe Lawrence <[email protected]>
Reported-by: Mikulas Patocka <[email protected]>
Reviewed-by: Mikulas Patocka <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Michael Kerrisk <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
round_pipe_size() contains a right-bit-shift expression which may
overflow, which would cause undefined results in a subsequent
roundup_pow_of_two() call.
static inline unsigned int round_pipe_size(unsigned int size)
{
unsigned long nr_pages;
nr_pages = (size + PAGE_SIZE - 1) >> PAGE_SHIFT;
return roundup_pow_of_two(nr_pages) << PAGE_SHIFT;
}
PAGE_SIZE is defined as (1UL << PAGE_SHIFT), so:
- 4 bytes wide on 32-bit (0 to 0xffffffff)
- 8 bytes wide on 64-bit (0 to 0xffffffffffffffff)
That means that 32-bit round_pipe_size(), nr_pages may overflow to 0:
size=0x00000000 nr_pages=0x0
size=0x00000001 nr_pages=0x1
size=0xfffff000 nr_pages=0xfffff
size=0xfffff001 nr_pages=0x0 << !
size=0xffffffff nr_pages=0x0 << !
This is bad because roundup_pow_of_two(n) is undefined when n == 0!
64-bit is not a problem as the unsigned int size is 4 bytes wide
(similar to 32-bit) and the larger, 8 byte wide unsigned long, is
sufficient to handle the largest value of the bit shift expression:
size=0xffffffff nr_pages=100000
Modify round_pipe_size() to return 0 if n == 0 and updates its callers to
handle accordingly.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Joe Lawrence <[email protected]>
Reported-by: Mikulas Patocka <[email protected]>
Reviewed-by: Mikulas Patocka <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Michael Kerrisk <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "A few round_pipe_size() and pipe-max-size fixups", v3.
While backporting Michael's "pipe: fix limit handling" patchset to a
distro-kernel, Mikulas noticed that current upstream pipe limit handling
contains a few problems:
1 - procfs signed wrap: echo'ing a large number into
/proc/sys/fs/pipe-max-size and then cat'ing it back out shows a
negative value.
2 - round_pipe_size() nr_pages overflow on 32bit: this would
subsequently try roundup_pow_of_two(0), which is undefined.
3 - visible non-rounded pipe-max-size value: there is no mutual
exclusion or protection between the time pipe_max_size is assigned
a raw value from proc_dointvec_minmax() and when it is rounded.
4 - unsigned long -> unsigned int conversion makes for potential odd
return errors from do_proc_douintvec_minmax_conv() and
do_proc_dopipe_max_size_conv().
This version underwent the same testing as v1:
https://marc.info/?l=linux-kernel&m=150643571406022&w=2
This patch (of 4):
pipe_max_size is defined as an unsigned int:
unsigned int pipe_max_size = 1048576;
but its procfs/sysctl representation is an integer:
static struct ctl_table fs_table[] = {
...
{
.procname = "pipe-max-size",
.data = &pipe_max_size,
.maxlen = sizeof(int),
.mode = 0644,
.proc_handler = &pipe_proc_fn,
.extra1 = &pipe_min_size,
},
...
that is signed:
int pipe_proc_fn(struct ctl_table *table, int write, void __user *buf,
size_t *lenp, loff_t *ppos)
{
...
ret = proc_dointvec_minmax(table, write, buf, lenp, ppos)
This leads to signed results via procfs for large values of pipe_max_size:
% echo 2147483647 >/proc/sys/fs/pipe-max-size
% cat /proc/sys/fs/pipe-max-size
-2147483648
Use unsigned operations on this variable to avoid such negative values.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Joe Lawrence <[email protected]>
Reported-by: Mikulas Patocka <[email protected]>
Reviewed-by: Mikulas Patocka <[email protected]>
Cc: Michael Kerrisk <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Currently if the autofs kernel module gets an error when writing to the
pipe which links to the daemon, then it marks the whole moutpoint as
catatonic, and it will stop working.
It is possible that the error is transient. This can happen if the
daemon is slow and more than 16 requests queue up. If a subsequent
process tries to queue a request, and is then signalled, the write to
the pipe will return -ERESTARTSYS and autofs will take that as total
failure.
So change the code to assess -ERESTARTSYS and -ENOMEM as transient
failures which only abort the current request, not the whole mountpoint.
It isn't a crash or a data corruption, but having autofs mountpoints
suddenly stop working is rather inconvenient.
Ian said:
: And given the problems with a half dozen (or so) user space applications
: consuming large amounts of CPU under heavy mount and umount activity this
: could happen more easily than we expect.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: NeilBrown <[email protected]>
Acked-by: Ian Kent <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
init/version.c has nothing to do with modules, so remove the
<linux/modude.h>.
Instead, include <linux/export.h> for EXPORT_SYMBOL_GPL.
This cuts off a lot of unnecessary header parsing.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Masahiro Yamada <[email protected]>
Cc: Paul Gortmaker <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The use of ep_call_nested() in ep_eventpoll_poll(), which is the .poll
routine for an epoll fd, is used to prevent excessively deep epoll
nesting, and to prevent circular paths.
However, we are already preventing these conditions during
EPOLL_CTL_ADD. In terms of too deep epoll chains, we do in fact allow
deep nesting of the epoll fds themselves (deeper than EP_MAX_NESTS),
however we don't allow more than EP_MAX_NESTS when an epoll file
descriptor is actually connected to a wakeup source. Thus, we do not
require the use of ep_call_nested(), since ep_eventpoll_poll(), which is
called via ep_scan_ready_list() only continues nesting if there are
events available.
Since ep_call_nested() is implemented using a global lock, applications
that make use of nested epoll can see large performance improvements
with this change.
Davidlohr said:
: Improvements are quite obscene actually, such as for the following
: epoll_wait() benchmark with 2 level nesting on a 80 core IvyBridge:
:
: ncpus vanilla dirty delta
: 1 2447092 3028315 +23.75%
: 4 231265 2986954 +1191.57%
: 8 121631 2898796 +2283.27%
: 16 59749 2902056 +4757.07%
: 32 26837 2326314 +8568.30%
: 64 12926 1341281 +10276.61%
:
: (http://linux-scalability.org/epoll/epoll-test.c)
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jason Baron <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Salman Qazi <[email protected]>
Cc: Hou Tao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
ep_poll_safewake() is used to wakeup potentially nested epoll file
descriptors. The function uses ep_call_nested() to prevent entering the
same wake up queue more than once, and to prevent excessively deep
wakeup paths (deeper than EP_MAX_NESTS). However, this is not necessary
since we are already preventing these conditions during EPOLL_CTL_ADD.
This saves extra function calls, and avoids taking a global lock during
the ep_call_nested() calls.
I have, however, left ep_call_nested() for the CONFIG_DEBUG_LOCK_ALLOC
case, since ep_call_nested() keeps track of the nesting level, and this
is required by the call to spin_lock_irqsave_nested(). It would be nice
to remove the ep_call_nested() calls for the CONFIG_DEBUG_LOCK_ALLOC
case as well, however its not clear how to simply pass the nesting level
through multiple wake_up() levels without more surgery. In any case, I
don't think CONFIG_DEBUG_LOCK_ALLOC is generally used for production.
This patch, also apparently fixes a workload at Google that Salman Qazi
reported by completely removing the poll_safewake_ncalls->lock from
wakeup paths.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Jason Baron <[email protected]>
Acked-by: Davidlohr Bueso <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Salman Qazi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
A userspace application can directly trigger the allocations from
eventpoll_epi and eventpoll_pwq slabs. A buggy or malicious application
can consume a significant amount of system memory by triggering such
allocations. Indeed we have seen in production where a buggy
application was leaking the epoll references and causing a burst of
eventpoll_epi and eventpoll_pwq slab allocations. This patch opt-in the
charging of eventpoll_epi and eventpoll_pwq slabs.
There is a per-user limit (~4% of total memory if no highmem) on these
caches. I think it is too generous particularly in the scenario where
jobs of multiple users are running on the system and the administrator
is reducing cost by overcomitting the memory. This is unaccounted
kernel memory and will not be considered by the oom-killer. I think by
accounting it to kmemcg, for systems with kmem accounting enabled, we
can provide better isolation between jobs of different users.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Shakeel Butt <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Greg Thelen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|