aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2017-01-31tools lib bpf: Add bpf_map__pin()Joe Stringer2-0/+23
Add a new API to pin a BPF map to the filesystem. The user can specify the path full path within a BPF filesystem to pin the map. Signed-off-by: Joe Stringer <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Daniel Borkmann <[email protected]> Cc: Wang Nan <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
2017-01-31tools lib bpf: Add BPF program pinning APIsJoe Stringer2-0/+123
Add new APIs to pin a BPF program (or specific instances) to the filesystem. The user can specify the path full path within a BPF filesystem to pin the program. bpf_program__pin_instance(prog, path, n) will pin the nth instance of 'prog' to the specified path. bpf_program__pin(prog, path) will create the directory 'path' (if it does not exist) and pin each instance within that directory. For instance, path/0, path/1, path/2. Committer notes: - Add missing headers for mkdir() - Check strdup() for failure - Check snprintf >= size, not >, as == also means truncated, see 'man snprintf', return value. - Conditionally define BPF_FS_MAGIC, as it isn't in magic.h in older systems and we're not yet having a tools/include/uapi/linux/magic.h copy. - Do not include linux/magic.h, not present in older distros. Signed-off-by: Joe Stringer <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Daniel Borkmann <[email protected]> Cc: Wang Nan <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
2017-01-31perf callchain: Reference count mapsKrister Johansen3-2/+22
If dso__load_kcore frees all of the existing maps, but one has already been attached to a callchain cursor node, then we can get a SIGSEGV in any function that happens to try to use this invalid cursor. Use the existing map refcount mechanism to forestall cleanup of a map until the cursor iterates past the node. Signed-off-by: Krister Johansen <[email protected]> Tested-by: Arnaldo Carvalho de Melo <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: Masami Hiramatsu <[email protected]> Cc: Namhyung Kim <[email protected]> Cc: [email protected] Fixes: 84c2cafa2889 ("perf tools: Reference count struct map") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
2017-01-30tools headers: Sync {tools/,}arch/powerpc/include/uapi/asm/kvm.h, ↵Ingo Molnar3-0/+25
{tools/,}arch/x86/include/asm/cpufeatures.h and {tools/,}arch/arm/include/uapi/asm/kvm.h The following upstream headers were updated: - The x86 cpufeatures.h file picked up a couple of new feature entries - The PowerPC and ARM KVM headers picked up new features None of which requires changes to perf tooling, so refresh the tooling copy. Solves these build time warnings: Warning: arch/x86/include/asm/cpufeatures.h differs from kernel Warning: arch/powerpc/include/uapi/asm/kvm.h differs from kernel Warning: arch/arm/include/uapi/asm/kvm.h differs from kernel Signed-off-by: Ingo Molnar <[email protected]> Cc: David Ahern <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Namhyung Kim <[email protected]> Cc: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] [ resync tools/arch/x86/include/asm/cpufeatures.h ] Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
2017-01-27perf tools: Propagate perf_config() errorsArnaldo Carvalho de Melo12-23/+62
Previously these were being ignored, sometimes silently. Stop doing that, emitting debug messages and handling the errors. Testing it: $ cat ~/.perfconfig cat: /home/acme/.perfconfig: No such file or directory $ perf stat -e cycles usleep 1 Performance counter stats for 'usleep 1': 938,996 cycles:u 0.003813731 seconds time elapsed $ perf top --stdio Error: You may not have permission to collect system-wide stats. Consider tweaking /proc/sys/kernel/perf_event_paranoid, <SNIP> [ perf record: Captured and wrote 0.019 MB perf.data (7 samples) ] [acme@jouet linux]$ perf report --stdio # To display the perf.data header info, please use --header/--header-only options. # Overhead Command Shared Object Symbol # ........ ....... ................. ......................... 71.77% usleep libc-2.24.so [.] _dl_addr 27.07% usleep ld-2.24.so [.] _dl_next_ld_env_entry 1.13% usleep [kernel.kallsyms] [k] page_fault $ $ touch ~/.perfconfig $ ls -la ~/.perfconfig -rw-rw-r--. 1 acme acme 0 Jan 27 12:14 /home/acme/.perfconfig $ $ perf stat -e instructions usleep 1 Performance counter stats for 'usleep 1': 244,610 instructions:u 0.000805383 seconds time elapsed $ [root@jouet ~]# chown acme.acme ~/.perfconfig [root@jouet ~]# perf stat -e cycles usleep 1 Warning: File /root/.perfconfig not owned by current user or root, ignoring it. Performance counter stats for 'usleep 1': 937,615 cycles 0.000836931 seconds time elapsed # Cc: Adrian Hunter <[email protected]> Cc: David Ahern <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Namhyung Kim <[email protected]> Cc: Wang Nan <[email protected]> Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
2017-01-27perf config: Do not consider an error not to have any perfconfig fileArnaldo Carvalho de Melo1-6/+8
While propagating the errors from perf_config(), which were being completely ignored, everything stopped working for people without a ~/.perfconfig file, because the perf_config_set__init() was considering an error not to have a .perfconfig file, duh, fix it by checking the errno after the failed stat() call. It should also not return an error when it says it is ignoring the file, and also a empty file should not return an error either. Cc: Adrian Hunter <[email protected]> Cc: David Ahern <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Namhyung Kim <[email protected]> Cc: Taeung Song <[email protected]> Cc: Wang Nan <[email protected]> Fixes: 8beeb00f2c84 ("perf config: Use new perf_config_set__init() to initialize config set") Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
2017-01-26tools build: Add tools tree support for 'make -s'Josh Poimboeuf3-3/+25
When doing a kernel build with 'make -s', everything is silenced except the objtool build. That's because the tools tree support for silent builds is some combination of missing and broken. Three changes are needed to fix it: - Makefile: propagate '-s' to the sub-make's MAKEFLAGS variable so the tools Makefiles can see it. - tools/scripts/Makefile.include: fix the tools Makefiles' ability to recognize '-s'. The MAKE_VERSION and MAKEFLAGS checks are copied from the top-level Makefile. This silences the "DESCEND objtool" message. - tools/build/Makefile.build: add support to the tools Build files for recognizing '-s'. Again the MAKE_VERSION and MAKEFLAGS checks are copied from the top-level Makefile. This silences all the object compile/link messages. Reported-and-Tested-by: Peter Zijlstra <[email protected]> Signed-off-by: Josh Poimboeuf <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Michal Marek <[email protected]> Link: http://lkml.kernel.org/r/e8967562ef640c3ae9a76da4ae0f4e47df737c34.1484799200.git.jpoimboe@redhat.com Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
2017-01-26perf ftrace: Remove needless code setting default tracerTaeung Song1-4/+1
As a result of commit a3497642c261 ("perf ftrace: Make 'function_graph' be the default tracer") the ftrace.tracer variable can't be NULL but the other code setting default tracer remained. Signed-off-by: Taeung Song <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Namhyung Kim <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Wang Nan <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
2017-01-26Merge tag 'perf-core-for-mingo-4.11-20170126' of ↵Ingo Molnar22-96/+600
git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core Pull the latest perf/core updates from Arnaldo Carvalho de Melo: New features: - Introduce 'perf ftrace' a perf front end to the kernel's ftrace function and function_graph tracer, defaulting to the "function_graph" tracer, more work will be done in reviving this effort, forward porting it from its initial patch submission (Namhyung Kim) - Add 'e' and 'c' hotkeys to expand/collapse call chains for a single hist entry in the 'perf report' and 'perf top' TUI (Jiri Olsa) Fixes: - Fix wrong register name for arm64, used in 'perf probe' (He Kuang) - Fix map offsets in relocation in libbpf (Joe Stringer) - Fix looking up dwarf unwind stack info (Matija Glavinic Pecotic) Infrastructure changes: - libbpf prog functions sync with what is exported via uapi (Joe Stringer) Trivial changes: - Remove unnecessary checks and assignments in 'perf probe's try_to_find_absolute_address() (Markus Elfring) Signed-off-by: Arnaldo Carvalho de Melo <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2017-01-26perf ftrace: Make 'function_graph' be the default tracerArnaldo Carvalho de Melo1-1/+2
So that we can suppress the '-t function_graph' and get a more compact command line: # perf ftrace usleep 123456 | grep raw_spin_lock | sort -k2 -nr | head -5 2) 0.555 us | _raw_spin_lock(); 2) 0.516 us | _raw_spin_lock(); 2) 0.410 us | _raw_spin_lock_irq(); 2) 0.374 us | _raw_spin_lock_irqsave(); # Tested-by: Masami Hiramatsu <[email protected]> Cc: Adrian Hunter <[email protected]> Cc: David Ahern <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: Jeremy Eder <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Namhyung Kim <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Stephane Eranian <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Wang Nan <[email protected]> Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
2017-01-26perf ftrace: Introduce new 'ftrace' toolNamhyung Kim6-0/+282
The 'perf ftrace' command is a simple wrapper of kernel's ftrace functionality. It only supports single thread tracing currently and just reads trace_pipe in text and then write it to stdout. Committer notes: Testing it: # perf ftrace -f function_graph usleep 123456 <SNIP> 2) | SyS_nanosleep() { 2) | _copy_from_user() { <SNIP> 2) 0.900 us | } 2) 1.354 us | } 2) | hrtimer_nanosleep() { 2) 0.062 us | __hrtimer_init(); 2) | do_nanosleep() { 2) | hrtimer_start_range_ns() { <SNIP> 2) 5.025 us | } 2) | schedule() { 2) 0.125 us | rcu_note_context_switch(); 2) 0.057 us | _raw_spin_lock(); 2) | deactivate_task() { 2) 0.369 us | update_rq_clock.part.77(); 2) | dequeue_task_fair() { <SNIP> 2) + 22.453 us | } 2) + 23.736 us | } 2) | pick_next_task_fair() { <SNIP> 2) + 47.167 us | } 2) | pick_next_task_idle() { <SNIP> 2) 4.462 us | } ------------------------------------------ 2) usleep-20387 => <idle>-0 ------------------------------------------ 2) 0.806 us | switch_mm_irqs_off(); ------------------------------------------ 2) <idle>-0 => usleep-20387 ------------------------------------------ 2) 0.151 us | finish_task_switch(); 2) @ 123597.2 us | } 2) 0.037 us | _cond_resched(); 2) | hrtimer_try_to_cancel() { 2) 0.064 us | hrtimer_active(); 2) 0.353 us | } 2) @ 123605.3 us | } 2) @ 123606.2 us | } 2) @ 123608.3 us | } /* SyS_nanosleep */ 2) | __do_page_fault() { <SNIP> Signed-off-by: Namhyung Kim <[email protected]> Tested-by: Arnaldo Carvalho de Melo <[email protected]> Tested-by: Masami Hiramatsu <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: Jeremy Eder <[email protected]> Cc: Jiri Olsa <[email protected]>, Cc: Paul Mackerras <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Stephane Eranian <[email protected]> Cc: Steven Rostedt <[email protected]> Link: http://lkml.kernel.org/n/[email protected] [ Various foward port fixes, add man page ] Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
2017-01-26perf util: Add more debug message on failure pathNamhyung Kim2-18/+39
It's helpful for debugging on tracing features. Signed-off-by: Namhyung Kim <[email protected]> Tested-by: Masami Hiramatsu <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: Jeremy Eder <[email protected]> Cc: Jiri Olsa <[email protected]>, Cc: Paul Mackerras <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Stephane Eranian <[email protected]> Cc: Steven Rostedt <[email protected]> Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
2017-01-26perf util: Save pid-cmdline mapping into tracing headerNamhyung Kim4-3/+84
Current trace info data lacks the saved cmdline mapping which is needed for pevent to find out the comm of a task. Add this and bump up the version number so that perf can determine its presence when reading. This is mostly corresponding to trace.dat file version 6, but still lacks 4 byte of number of cpus, and 10 bytes of type string - and I think we don't need those anyway. Signed-off-by: Namhyung Kim <[email protected]> Tested-by: Masami Hiramatsu <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: Jeremy Eder <[email protected]> Cc: Jiri Olsa <[email protected]>, Cc: Paul Mackerras <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Stephane Eranian <[email protected]> Cc: Steven Rostedt <[email protected]> [ Change version test from == to >= ] Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
2017-01-26perf scripting perl: Do not die() when not founding event for a typeArnaldo Carvalho de Melo1-2/+4
Do just like handling other cases i.e. print some debug message and ignore the sample. Cc: Adrian Hunter <[email protected]> Cc: David Ahern <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Namhyung Kim <[email protected]> Cc: Wang Nan <[email protected]> Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
2017-01-26tools lib bpf: Add libbpf_get_error()Joe Stringer3-2/+12
This function will turn a libbpf pointer into a standard error code (or 0 if the pointer is valid). This also allows removal of the dependency on linux/err.h in the public header file, which causes problems in userspace programs built against libbpf. Signed-off-by: Joe Stringer <[email protected]> Acked-by: Wang Nan <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Daniel Borkmann <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
2017-01-26tools lib bpf: Add set/is helpers for all prog typesJoe Stringer2-0/+15
These bpf_prog_types were exposed in the uapi but there were no corresponding functions to set these types for programs in libbpf. Signed-off-by: Joe Stringer <[email protected]> Acked-by: Wang Nan <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Daniel Borkmann <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
2017-01-26tools lib bpf: Define prog_type fns with macroJoe Stringer1-25/+16
Turning this into a macro allows future prog types to be added with a single line per type. Signed-off-by: Joe Stringer <[email protected]> Acked-by: Wang Nan <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Daniel Borkmann <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
2017-01-26tools lib bpf: Fix map offsets in relocationJoe Stringer1-3/+12
Commit 4708bbda5cb2 ("tools lib bpf: Fix maps resolution") attempted to fix map resolution by identifying the number of symbols that point to maps, and using this number to resolve each of the maps. However, during relocation the original definition of the map size was still in use. For up to two maps, the calculation was correct if there was a small difference in size between the map definition in libbpf and the one that the client library uses. However if the difference was large, particularly if more than two maps were used in the BPF program, the relocation would fail. For example, when using a map definition with size 28, with three maps, map relocation would count: (sym_offset / sizeof(struct bpf_map_def) => map_idx) (0 / 16 => 0), ie map_idx = 0 (28 / 16 => 1), ie map_idx = 1 (56 / 16 => 3), ie map_idx = 3 So, libbpf reports: libbpf: bpf relocation: map_idx 3 large than 2 Fix map relocation by checking the exact offset of maps when doing relocation. Signed-off-by: Joe Stringer <[email protected]> [Allow different map size in an object] Signed-off-by: Wang Nan <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Daniel Borkmann <[email protected]> Cc: [email protected] Fixes: 4708bbda5cb2 ("tools lib bpf: Fix maps resolution") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
2017-01-26perf probe: Delete an unnecessary assignment in try_to_find_absolute_address()Markus Elfring1-3/+2
Remove an error code assignment which is redundant in an if branch for the handling of a memory allocation failure because the same value was set for the local variable "err" before. Signed-off-by: Markus Elfring <[email protected]> Acked-by: Masami Hiramatsu <[email protected]> Cc: Adrian Hunter <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: He Kuang <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Milian Wolff <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Ravi Bangoria <[email protected]> Cc: Wang Nan <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
2017-01-26perf probe: Delete an unnecessary check in try_to_find_absolute_address()Markus Elfring1-4/+2
Remove a condition check which is unnecessary at the end because this source code place should usually only be reached with a non-zero pointer. Signed-off-by: Markus Elfring <[email protected]> Acked-by: Masami Hiramatsu <[email protected]> Cc: Adrian Hunter <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: He Kuang <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Milian Wolff <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Ravi Bangoria <[email protected]> Cc: Wang Nan <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
2017-01-26perf probe: Fix wrong register name for arm64He Kuang1-6/+6
The register name of arm64 architecture is x0-x31 not r0-r31, this patch changes this typo. Before this patch: # perf probe --definition 'sys_write count' p:probe/sys_write _text+1502872 count=%r2:s64 # echo 'p:probe/sys_write _text+1502872 count=%r2:s64' > \ /sys/kernel/debug/tracing/kprobe_events Parse error at argument[0]. (-22) After this patch: # perf probe --definition 'sys_write count' p:probe/sys_write _text+1502872 count=%x2:s64 # echo 'p:probe/sys_write _text+1502872 count=%x2:s64' > \ /sys/kernel/debug/tracing/kprobe_events # echo 1 >/sys/kernel/debug/tracing/events/probe/enable # cat /sys/kernel/debug/tracing/trace ... sh-422 [000] d... 650.495930: sys_write: (SyS_write+0x0/0xc8) count=22 sh-422 [000] d... 651.102389: sys_write: (SyS_write+0x0/0xc8) count=26 sh-422 [000] d... 651.358653: sys_write: (SyS_write+0x0/0xc8) count=86 Signed-off-by: He Kuang <[email protected]> Acked-by: Masami Hiramatsu <[email protected]> Acked-by: Will Deacon <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: Bintian Wang <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Wang Nan <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
2017-01-25Merge branch 'linus' into perf/core, to pick up fixesIngo Molnar860-5016/+8228
Signed-off-by: Ingo Molnar <[email protected]>
2017-01-24Merge branch 'akpm' (patches from Andrew)Linus Torvalds28-78/+237
Merge fixes from Andrew Morton: "26 fixes" * emailed patches from Andrew Morton <[email protected]>: (26 commits) MAINTAINERS: add Dan Streetman to zbud maintainers MAINTAINERS: add Dan Streetman to zswap maintainers mm: do not export ioremap_page_range symbol for external module mn10300: fix build error of missing fpu_save() romfs: use different way to generate fsid for BLOCK or MTD frv: add missing atomic64 operations mm, page_alloc: fix premature OOM when racing with cpuset mems update mm, page_alloc: move cpuset seqcount checking to slowpath mm, page_alloc: fix fast-path race with cpuset update or removal mm, page_alloc: fix check for NULL preferred_zone kernel/panic.c: add missing \n fbdev: color map copying bounds checking frv: add atomic64_add_unless() mm/mempolicy.c: do not put mempolicy before using its nodemask radix-tree: fix private list warnings Documentation/filesystems/proc.txt: add VmPin mm, memcg: do not retry precharge charges proc: add a schedule point in proc_pid_readdir() mm: alloc_contig: re-allow CMA to compact FS pages mm/slub.c: trace free objects at KERN_INFO ...
2017-01-24MAINTAINERS: add Dan Streetman to zbud maintainersDan Streetman1-0/+1
Add myself as zbud maintainer. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Dan Streetman <[email protected]> Cc: Seth Jennings <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-01-24MAINTAINERS: add Dan Streetman to zswap maintainersDan Streetman1-0/+1
Add myself as zswap maintainer. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Dan Streetman <[email protected]> Acked-by: Seth Jennings <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-01-24mm: do not export ioremap_page_range symbol for external modulezhong jiang1-1/+0
Recently, I've found cases in which ioremap_page_range was used incorrectly, in external modules, leading to crashes. This can be partly attributed to the fact that ioremap_page_range is lower-level, with fewer protections, as compared to the other functions that an external module would typically call. Those include: ioremap_cache ioremap_nocache ioremap_prot ioremap_uc ioremap_wc ioremap_wt ...each of which wraps __ioremap_caller, which in turn provides a safer way to achieve the mapping. Therefore, stop EXPORT-ing ioremap_page_range. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: zhong jiang <[email protected]> Reviewed-by: John Hubbard <[email protected]> Suggested-by: John Hubbard <[email protected]> Acked-by: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-01-24mn10300: fix build error of missing fpu_save()Randy Dunlap1-1/+1
When CONFIG_FPU is not enabled on arch/mn10300, <asm/switch_to.h> causes a build error with a call to fpu_save(): kernel/built-in.o: In function `.L410': core.c:(.sched.text+0x28a): undefined reference to `fpu_save' Fix this by including <asm/fpu.h> in <asm/switch_to.h> so that an empty static inline fpu_save() is defined. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Randy Dunlap <[email protected]> Reported-by: kbuild test robot <[email protected]> Reviewed-by: David Howells <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-01-24romfs: use different way to generate fsid for BLOCK or MTDColy Li1-1/+22
Commit 8a59f5d25265 ("fs/romfs: return f_fsid for statfs(2)") generates a 64bit id from sb->s_bdev->bd_dev. This is only correct when romfs is defined with CONFIG_ROMFS_ON_BLOCK. If romfs is only defined with CONFIG_ROMFS_ON_MTD, sb->s_bdev is NULL, referencing sb->s_bdev->bd_dev will triger an oops. Richard Weinberger points out that when CONFIG_ROMFS_BACKED_BY_BOTH=y, both CONFIG_ROMFS_ON_BLOCK and CONFIG_ROMFS_ON_MTD are defined. Therefore when calling huge_encode_dev() to generate a 64bit id, I use the follow order to choose parameter, - CONFIG_ROMFS_ON_BLOCK defined use sb->s_bdev->bd_dev - CONFIG_ROMFS_ON_BLOCK undefined and CONFIG_ROMFS_ON_MTD defined use sb->s_dev when, - both CONFIG_ROMFS_ON_BLOCK and CONFIG_ROMFS_ON_MTD undefined leave id as 0 When CONFIG_ROMFS_ON_MTD is defined and sb->s_mtd is not NULL, sb->s_dev is set to a device ID generated by MTD_BLOCK_MAJOR and mtd index, otherwise sb->s_dev is 0. This is a try-best effort to generate a uniq file system ID, if all the above conditions are not meet, f_fsid of this romfs instance will be 0. Generally only one romfs can be built on single MTD block device, this method is enough to identify multiple romfs instances in a computer. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Coly Li <[email protected]> Reported-by: Nong Li <[email protected]> Tested-by: Nong Li <[email protected]> Cc: Richard Weinberger <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-01-24frv: add missing atomic64 operationsSudip Mukherjee1-1/+18
Some more atomic64 operations were missing and as a result frv allmodconfig was failing. Add the missing operations. Link: http://lkml.kernel.org/r/1485193844-12850-1-git-send-email-sudip.mukherjee@codethink.co.uk Signed-off-by: Sudip Mukherjee <[email protected]> Cc: David Howells <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-01-24mm, page_alloc: fix premature OOM when racing with cpuset mems updateVlastimil Babka1-11/+24
Ganapatrao Kulkarni reported that the LTP test cpuset01 in stress mode triggers OOM killer in few seconds, despite lots of free memory. The test attempts to repeatedly fault in memory in one process in a cpuset, while changing allowed nodes of the cpuset between 0 and 1 in another process. The problem comes from insufficient protection against cpuset changes, which can cause get_page_from_freelist() to consider all zones as non-eligible due to nodemask and/or current->mems_allowed. This was masked in the past by sufficient retries, but since commit 682a3385e773 ("mm, page_alloc: inline the fast path of the zonelist iterator") we fix the preferred_zoneref once, and don't iterate over the whole zonelist in further attempts, thus the only eligible zones might be placed in the zonelist before our starting point and we always miss them. A previous patch fixed this problem for current->mems_allowed. However, cpuset changes also update the task's mempolicy nodemask. The fix has two parts. We have to repeat the preferred_zoneref search when we detect cpuset update by way of seqcount, and we have to check the seqcount before considering OOM. [[email protected]: fix typo in comment] Link: http://lkml.kernel.org/r/[email protected] Fixes: c33d6c06f60f ("mm, page_alloc: avoid looking up the first zone in a zonelist twice") Signed-off-by: Vlastimil Babka <[email protected]> Reported-by: Ganapatrao Kulkarni <[email protected]> Acked-by: Mel Gorman <[email protected]> Acked-by: Hillf Danton <[email protected]> Cc: Michal Hocko <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-01-24mm, page_alloc: move cpuset seqcount checking to slowpathVlastimil Babka1-21/+26
This is a preparation for the following patch to make review simpler. While the primary motivation is a bug fix, this also simplifies the fast path, although the moved code is only enabled when cpusets are in use. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Vlastimil Babka <[email protected]> Acked-by: Mel Gorman <[email protected]> Acked-by: Hillf Danton <[email protected]> Cc: Ganapatrao Kulkarni <[email protected]> Cc: Michal Hocko <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-01-24mm, page_alloc: fix fast-path race with cpuset update or removalVlastimil Babka1-1/+9
Ganapatrao Kulkarni reported that the LTP test cpuset01 in stress mode triggers OOM killer in few seconds, despite lots of free memory. The test attempts to repeatedly fault in memory in one process in a cpuset, while changing allowed nodes of the cpuset between 0 and 1 in another process. One possible cause is that in the fast path we find the preferred zoneref according to current mems_allowed, so that it points to the middle of the zonelist, skipping e.g. zones of node 1 completely. If the mems_allowed is updated to contain only node 1, we never reach it in the zonelist, and trigger OOM before checking the cpuset_mems_cookie. This patch fixes the particular case by redoing the preferred zoneref search if we switch back to the original nodemask. The condition is also slightly changed so that when the last non-root cpuset is removed, we don't miss it. Note that this is not a full fix, and more patches will follow. Link: http://lkml.kernel.org/r/[email protected] Fixes: 682a3385e773 ("mm, page_alloc: inline the fast path of the zonelist iterator") Signed-off-by: Vlastimil Babka <[email protected]> Reported-by: Ganapatrao Kulkarni <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Mel Gorman <[email protected]> Acked-by: Hillf Danton <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-01-24mm, page_alloc: fix check for NULL preferred_zoneVlastimil Babka2-2/+6
Patch series "fix premature OOM regression in 4.7+ due to cpuset races". This is v2 of my attempt to fix the recent report based on LTP cpuset stress test [1]. The intention is to go to stable 4.9 LTSS with this, as triggering repeated OOMs is not nice. That's why the patches try to be not too intrusive. Unfortunately why investigating I found that modifying the testcase to use per-VMA policies instead of per-task policies will bring the OOM's back, but that seems to be much older and harder to fix problem. I have posted a RFC [2] but I believe that fixing the recent regressions has a higher priority. Longer-term we might try to think how to fix the cpuset mess in a better and less error prone way. I was for example very surprised to learn, that cpuset updates change not only task->mems_allowed, but also nodemask of mempolicies. Until now I expected the parameter to alloc_pages_nodemask() to be stable. I wonder why do we then treat cpusets specially in get_page_from_freelist() and distinguish HARDWALL etc, when there's unconditional intersection between mempolicy and cpuset. I would expect the nodemask adjustment for saving overhead in g_p_f(), but that clearly doesn't happen in the current form. So we have both crazy complexity and overhead, AFAICS. [1] https://lkml.kernel.org/r/CAFpQJXUq-JuEP=QPidy4p_=FN0rkH5Z-kfB4qBvsf6jMS87Edg@mail.gmail.com [2] https://lkml.kernel.org/r/[email protected] This patch (of 4): Since commit c33d6c06f60f ("mm, page_alloc: avoid looking up the first zone in a zonelist twice") we have a wrong check for NULL preferred_zone, which can theoretically happen due to concurrent cpuset modification. We check the zoneref pointer which is never NULL and we should check the zone pointer. Also document this in first_zones_zonelist() comment per Michal Hocko. Fixes: c33d6c06f60f ("mm, page_alloc: avoid looking up the first zone in a zonelist twice") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Vlastimil Babka <[email protected]> Acked-by: Mel Gorman <[email protected]> Acked-by: Hillf Danton <[email protected]> Cc: Ganapatrao Kulkarni <[email protected]> Cc: Michal Hocko <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-01-24kernel/panic.c: add missing \nJiri Slaby1-1/+1
When a system panics, the "Rebooting in X seconds.." message is never printed because it lacks a new line. Fix it. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jiri Slaby <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-01-24fbdev: color map copying bounds checkingKees Cook1-12/+14
Copying color maps to userspace doesn't check the value of to->start, which will cause kernel heap buffer OOB read due to signedness wraps. CVE-2016-8405 Link: http://lkml.kernel.org/r/20170105224249.GA50925@beast Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Kees Cook <[email protected]> Reported-by: Peter Pi (@heisecode) of Trend Micro Cc: Min Chong <[email protected]> Cc: Dan Carpenter <[email protected]> Cc: Tomi Valkeinen <[email protected]> Cc: Bartlomiej Zolnierkiewicz <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-01-24frv: add atomic64_add_unless()Sudip Mukherjee1-0/+16
The build of frv allmodconfig was failing with the error: lib/atomic64_test.c:209:9: error: implicit declaration of function 'atomic64_add_unless' All the atomic64 operations were defined in frv, but atomic64_add_unless() was not done. Implement atomic64_add_unless() as done in other arches. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Sudip Mukherjee <[email protected]> Cc: David Howells <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-01-24mm/mempolicy.c: do not put mempolicy before using its nodemaskVlastimil Babka1-1/+1
Since commit be97a41b291e ("mm/mempolicy.c: merge alloc_hugepage_vma to alloc_pages_vma") alloc_pages_vma() can potentially free a mempolicy by mpol_cond_put() before accessing the embedded nodemask by __alloc_pages_nodemask(). The commit log says it's so "we can use a single exit path within the function" but that's clearly wrong. We can still do that when doing mpol_cond_put() after the allocation attempt. Make sure the mempolicy is not freed prematurely, otherwise __alloc_pages_nodemask() can end up using a bogus nodemask, which could lead e.g. to premature OOM. Fixes: be97a41b291e ("mm/mempolicy.c: merge alloc_hugepage_vma to alloc_pages_vma") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Vlastimil Babka <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: <[email protected]> [4.0+] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-01-24radix-tree: fix private list warningsMatthew Wilcox1-1/+1
The newly introduced warning in radix_tree_free_nodes() was testing the wrong variable; it should have been 'old' instead of 'node'. Fixes: ea07b862ac8e ("mm: workingset: fix use-after-free in shadow node shrinker") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox <[email protected]> Signed-off-by: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-01-24Documentation/filesystems/proc.txt: add VmPinFabian Frederick1-2/+3
Commit bc3e53f682d9 ("mm: distinguish between mlocked and pinned pages") added VmPin in /proc/<pid>/status. Report that in Documentation/filesystems/proc.txt Also move Umask after Name to keep correct order. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Fabian Frederick <[email protected]> Cc: Christoph Lameter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-01-24mm, memcg: do not retry precharge chargesDavid Rientjes1-2/+2
When memory.move_charge_at_immigrate is enabled and precharges are depleted during move, mem_cgroup_move_charge_pte_range() will attempt to increase the size of the precharge. Prevent precharges from ever looping by setting __GFP_NORETRY. This was probably the intention of the GFP_KERNEL & ~__GFP_NORETRY, which is pointless as written. Fixes: 0029e19ebf84 ("mm: memcontrol: remove explicit OOM parameter in charge path") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: David Rientjes <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Vladimir Davydov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-01-24proc: add a schedule point in proc_pid_readdir()Eric Dumazet1-0/+2
We have seen proc_pid_readdir() invocations holding cpu for more than 50 ms. Add a cond_resched() to be gentle with other tasks. [[email protected]: coding style fix] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-01-24mm: alloc_contig: re-allow CMA to compact FS pagesLucas Stach1-0/+1
Commit 73e64c51afc5 ("mm, compaction: allow compaction for GFP_NOFS requests") changed compation to skip FS pages if not explicitly allowed to touch them, but missed to update the CMA compact_control. This leads to a very high isolation failure rate, crippling performance of CMA even on a lightly loaded system. Re-allow CMA to compact FS pages by setting the correct GFP flags, restoring CMA behavior and performance to the kernel 4.9 level. Fixes: 73e64c51afc5 (mm, compaction: allow compaction for GFP_NOFS requests) Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Lucas Stach <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-01-24mm/slub.c: trace free objects at KERN_INFODaniel Thompson1-10/+13
Currently when trace is enabled (e.g. slub_debug=T,kmalloc-128 ) the trace messages are mostly output at KERN_INFO. However the trace code also calls print_section() to hexdump the head of a free object. This is hard coded to use KERN_ERR, meaning the console is deluged with trace messages even if we've asked for quiet. Fix this the obvious way but adding a level parameter to print_section(), allowing calls from the trace code to use the same trace level as other trace messages. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Daniel Thompson <[email protected]> Acked-by: Christoph Lameter <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-01-24userfaultfd: fix SIGBUS resulting from false rwsem wakeupsAndrea Arcangeli1-2/+35
With >=32 CPUs the userfaultfd selftest triggered a graceful but unexpected SIGBUS because VM_FAULT_RETRY was returned by handle_userfault() despite the UFFDIO_COPY wasn't completed. This seems caused by rwsem waking the thread blocked in handle_userfault() and we can't run up_read() before the wait_event sequence is complete. Keeping the wait_even sequence identical to the first one, would require running userfaultfd_must_wait() again to know if the loop should be repeated, and it would also require retaking the rwsem and revalidating the whole vma status. It seems simpler to wait the targeted wakeup so that if false wakeups materialize we still wait for our specific wakeup event, unless of course there are signals or the uffd was released. Debug code collecting the stack trace of the wakeup showed this: $ ./userfaultfd 100 99999 nr_pages: 25600, nr_pages_per_cpu: 800 bounces: 99998, mode: racing ver poll, userfaults: 32 35 90 232 30 138 69 82 34 30 139 40 40 31 20 19 43 13 15 28 27 38 21 43 56 22 1 17 31 8 4 2 bounces: 99997, mode: rnd ver poll, Bus error (core dumped) save_stack_trace+0x2b/0x50 try_to_wake_up+0x2a6/0x580 wake_up_q+0x32/0x70 rwsem_wake+0xe0/0x120 call_rwsem_wake+0x1b/0x30 up_write+0x3b/0x40 vm_mmap_pgoff+0x9c/0xc0 SyS_mmap_pgoff+0x1a9/0x240 SyS_mmap+0x22/0x30 entry_SYSCALL_64_fastpath+0x1f/0xbd 0xffffffffffffffff FAULT_FLAG_ALLOW_RETRY missing 70 CPU: 24 PID: 1054 Comm: userfaultfd Tainted: G W 4.8.0+ #30 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014 Call Trace: dump_stack+0xb8/0x112 handle_userfault+0x572/0x650 handle_mm_fault+0x12cb/0x1520 __do_page_fault+0x175/0x500 trace_do_page_fault+0x61/0x270 do_async_page_fault+0x19/0x90 async_page_fault+0x25/0x30 This always happens when the main userfault selftest thread is running clone() while glibc runs either mprotect or mmap (both taking mmap_sem down_write()) to allocate the thread stack of the background threads, while locking/userfault threads already run at full throttle and are susceptible to false wakeups that may cause handle_userfault() to return before than expected (which results in graceful SIGBUS at the next attempt). This was reproduced only with >=32 CPUs because the loop to start the thread where clone() is too quick with fewer CPUs, while with 32 CPUs there's already significant activity on ~32 locking and userfault threads when the last background threads are started with clone(). This >=32 CPUs SMP race condition is likely reproducible only with the selftest because of the much heavier userfault load it generates if compared to real apps. We'll have to allow "one more" VM_FAULT_RETRY for the WP support and a patch floating around that provides it also hidden this problem but in reality only is successfully at hiding the problem. False wakeups could still happen again the second time handle_userfault() is invoked, even if it's a so rare race condition that getting false wakeups twice in a row is impossible to reproduce. This full fix is needed for correctness, the only alternative would be to allow VM_FAULT_RETRY to be returned infinitely. With this fix the WP support can stick to a strict "one more" VM_FAULT_RETRY logic (no need of returning it infinite times to avoid the SIGBUS). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Andrea Arcangeli <[email protected]> Reported-by: Shubham Kumar Sharma <[email protected]> Tested-by: Mike Kravetz <[email protected]> Acked-by: Hillf Danton <[email protected]> Cc: Michael Rapoport <[email protected]> Cc: "Dr. David Alan Gilbert" <[email protected]> Cc: Pavel Emelyanov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-01-24drivers/memstick/core/memstick.c: avoid -Wnonnull warningArnd Bergmann1-1/+1
gcc-7 produces a harmless false-postive warning about a possible NULL pointer access: drivers/memstick/core/memstick.c: In function 'h_memstick_read_dev_id': drivers/memstick/core/memstick.c:309:3: error: argument 2 null where non-null expected [-Werror=nonnull] memcpy(mrq->data, buf, mrq->data_len); This can't happen because the caller sets the command to 'MS_TPC_READ_REG', which causes the data direction to be 'READ' and the NULL pointer not accessed. As a simple workaround for the warning, we can pass a pointer to the data that we actually want to read into. This is not needed here, but also harmless, and lets the compiler know that the access is ok. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Arnd Bergmann <[email protected]> Cc: Alex Dubov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-01-24kernel/watchdog: prevent false hardlockup on overloaded systemDon Zickus3-0/+13
On an overloaded system, it is possible that a change in the watchdog threshold can be delayed long enough to trigger a false positive. This can easily be achieved by having a cpu spinning indefinitely on a task, while another cpu updates watchdog threshold. What happens is while trying to park the watchdog threads, the hrtimers on the other cpus trigger and reprogram themselves with the new slower watchdog threshold. Meanwhile, the nmi watchdog is still programmed with the old faster threshold. Because the one cpu is blocked, it prevents the thread parking on the other cpus from completing, which is needed to shutdown the nmi watchdog and reprogram it correctly. As a result, a false positive from the nmi watchdog is reported. Fix this by setting a park_in_progress flag to block all lockups until the parking is complete. Fix provided by Ulrich Obergfell. [[email protected]: s/park_in_progress/watchdog_park_in_progress/] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Don Zickus <[email protected]> Reviewed-by: Aaron Tomlin <[email protected]> Cc: Ulrich Obergfell <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-01-24dax: fix build warnings with FS_DAX and !FS_IOMAPRoss Zwisler4-4/+1
As reported by Arnd: https://lkml.org/lkml/2017/1/10/756 Compiling with the following configuration: # CONFIG_EXT2_FS is not set # CONFIG_EXT4_FS is not set # CONFIG_XFS_FS is not set # CONFIG_FS_IOMAP depends on the above filesystems, as is not set CONFIG_FS_DAX=y generates build warnings about unused functions in fs/dax.c: fs/dax.c:878:12: warning: `dax_insert_mapping' defined but not used [-Wunused-function] static int dax_insert_mapping(struct address_space *mapping, ^~~~~~~~~~~~~~~~~~ fs/dax.c:572:12: warning: `copy_user_dax' defined but not used [-Wunused-function] static int copy_user_dax(struct block_device *bdev, sector_t sector, size_t size, ^~~~~~~~~~~~~ fs/dax.c:542:12: warning: `dax_load_hole' defined but not used [-Wunused-function] static int dax_load_hole(struct address_space *mapping, void **entry, ^~~~~~~~~~~~~ fs/dax.c:312:14: warning: `grab_mapping_entry' defined but not used [-Wunused-function] static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index, ^~~~~~~~~~~~~~~~~~ Now that the struct buffer_head based DAX fault paths and I/O path have been removed we really depend on iomap support being present for DAX. Make this explicit by selecting FS_IOMAP if we compile in DAX support. This allows us to remove conditional selections of FS_IOMAP when FS_DAX was present for ext2 and ext4, and to remove an #ifdef in fs/dax.c. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ross Zwisler <[email protected]> Reported-by: Arnd Bergmann <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-01-24mm/huge_memory.c: respect FOLL_FORCE/FOLL_COW for thpKeno Fischer1-1/+17
In commit 19be0eaffa3a ("mm: remove gup_flags FOLL_WRITE games from __get_user_pages()"), the mm code was changed from unsetting FOLL_WRITE after a COW was resolved to setting the (newly introduced) FOLL_COW instead. Simultaneously, the check in gup.c was updated to still allow writes with FOLL_FORCE set if FOLL_COW had also been set. However, a similar check in huge_memory.c was forgotten. As a result, remote memory writes to ro regions of memory backed by transparent huge pages cause an infinite loop in the kernel (handle_mm_fault sets FOLL_COW and returns 0 causing a retry, but follow_trans_huge_pmd bails out immidiately because `(flags & FOLL_WRITE) && !pmd_write(*pmd)` is true. While in this state the process is stil SIGKILLable, but little else works (e.g. no ptrace attach, no other signals). This is easily reproduced with the following code (assuming thp are set to always): #include <assert.h> #include <fcntl.h> #include <stdint.h> #include <stdio.h> #include <string.h> #include <sys/mman.h> #include <sys/stat.h> #include <sys/types.h> #include <sys/wait.h> #include <unistd.h> #define TEST_SIZE 5 * 1024 * 1024 int main(void) { int status; pid_t child; int fd = open("/proc/self/mem", O_RDWR); void *addr = mmap(NULL, TEST_SIZE, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE, 0, 0); assert(addr != MAP_FAILED); pid_t parent_pid = getpid(); if ((child = fork()) == 0) { void *addr2 = mmap(NULL, TEST_SIZE, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, 0, 0); assert(addr2 != MAP_FAILED); memset(addr2, 'a', TEST_SIZE); pwrite(fd, addr2, TEST_SIZE, (uintptr_t)addr); return 0; } assert(child == waitpid(child, &status, 0)); assert(WIFEXITED(status) && WEXITSTATUS(status) == 0); return 0; } Fix this by updating follow_trans_huge_pmd in huge_memory.c analogously to the update in gup.c in the original commit. The same pattern exists in follow_devmap_pmd. However, we should not be able to reach that check with FOLL_COW set, so add WARN_ONCE to make sure we notice if we ever do. [[email protected]: coding-style fixes] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Keno Fischer <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Greg Thelen <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Willy Tarreau <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Kees Cook <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-01-24memory_hotplug: make zone_can_shift() return a boolean valueYasuaki Ishimatsu3-15/+21
online_{kernel|movable} is used to change the memory zone to ZONE_{NORMAL|MOVABLE} and online the memory. To check that memory zone can be changed, zone_can_shift() is used. Currently the function returns minus integer value, plus integer value and 0. When the function returns minus or plus integer value, it means that the memory zone can be changed to ZONE_{NORNAL|MOVABLE}. But when the function returns 0, there are two meanings. One of the meanings is that the memory zone does not need to be changed. For example, when memory is in ZONE_NORMAL and onlined by online_kernel the memory zone does not need to be changed. Another meaning is that the memory zone cannot be changed. When memory is in ZONE_NORMAL and onlined by online_movable, the memory zone may not be changed to ZONE_MOVALBE due to memory online limitation(see Documentation/memory-hotplug.txt). In this case, memory must not be onlined. The patch changes the return type of zone_can_shift() so that memory online operation fails when memory zone cannot be changed as follows: Before applying patch: # grep -A 35 "Node 2" /proc/zoneinfo Node 2, zone Normal <snip> node_scanned 0 spanned 8388608 present 7864320 managed 7864320 # echo online_movable > memory4097/state # grep -A 35 "Node 2" /proc/zoneinfo Node 2, zone Normal <snip> node_scanned 0 spanned 8388608 present 8388608 managed 8388608 online_movable operation succeeded. But memory is onlined as ZONE_NORMAL, not ZONE_MOVABLE. After applying patch: # grep -A 35 "Node 2" /proc/zoneinfo Node 2, zone Normal <snip> node_scanned 0 spanned 8388608 present 7864320 managed 7864320 # echo online_movable > memory4097/state bash: echo: write error: Invalid argument # grep -A 35 "Node 2" /proc/zoneinfo Node 2, zone Normal <snip> node_scanned 0 spanned 8388608 present 7864320 managed 7864320 online_movable operation failed because of failure of changing the memory zone from ZONE_NORMAL to ZONE_MOVABLE Fixes: df429ac03936 ("memory-hotplug: more general validation of zone during online") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Yasuaki Ishimatsu <[email protected]> Reviewed-by: Reza Arbab <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-01-24Merge tag 'platform-drivers-x86-v4.10-4' of ↵Linus Torvalds5-6/+6
git://git.infradead.org/linux-platform-drivers-x86 Pull x86 platform-driver fixes from Andy Shevchenko: "This is my first pull request since I become a co-maintainer of Platform Drivers x86 subsystem. It's a bit bigger than usual due to material collected for almost two weeks in a row. MAINTAINERS: - Add myself to X86 PLATFORM DRIVERS as a co-maintainer ideapad-laptop: - handle ACPI event 1 intel_mid_powerbtn: - Set IRQ_ONESHOT surface3-wmi: - fix uninitialized symbol - Shut up unused-function warning mlx-platform: - free first dev on error" * tag 'platform-drivers-x86-v4.10-4' of git://git.infradead.org/linux-platform-drivers-x86: MAINTAINERS: Add myself to X86 PLATFORM DRIVERS as a co-maintainer platform/x86: ideapad-laptop: handle ACPI event 1 platform/x86: intel_mid_powerbtn: Set IRQ_ONESHOT platform/x86: surface3-wmi: fix uninitialized symbol platform/x86: surface3-wmi: Shut up unused-function warning platform/x86: mlx-platform: free first dev on error