aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2016-04-23perf/core: Add ::write_backward attribute to perf eventWang Nan2-6/+58
This patch introduces 'write_backward' bit to perf_event_attr, which controls the direction of a ring buffer. After set, the corresponding ring buffer is written from end to beginning. This feature is design to support reading from overwritable ring buffer. Ring buffer can be created by mapping a perf event fd. Kernel puts event records into ring buffer, user tooling like perf fetch them from address returned by mmap(). To prevent racing between kernel and tooling, they communicate to each other through 'head' and 'tail' pointers. Kernel maintains 'head' pointer, points it to the next free area (tail of the last record). Tooling maintains 'tail' pointer, points it to the tail of last consumed record (record has already been fetched). Kernel determines the available space in a ring buffer using these two pointers to avoid overwrite unfetched records. By mapping without 'PROT_WRITE', an overwritable ring buffer is created. Different from normal ring buffer, tooling is unable to maintain 'tail' pointer because writing is forbidden. Therefore, for this type of ring buffers, kernel overwrite old records unconditionally, works like flight recorder. This feature would be useful if reading from overwritable ring buffer were as easy as reading from normal ring buffer. However, there's an obscure problem. The following figure demonstrates a full overwritable ring buffer. In this figure, the 'head' pointer points to the end of last record, and a long record 'E' is pending. For a normal ring buffer, a 'tail' pointer would have pointed to position (X), so kernel knows there's no more space in the ring buffer. However, for an overwritable ring buffer, kernel ignore the 'tail' pointer. (X) head . | . V +------+-------+----------+------+---+ |A....A|B.....B|C........C|D....D| | +------+-------+----------+------+---+ Record 'A' is overwritten by event 'E': head | V +--+---+-------+----------+------+---+ |.E|..A|B.....B|C........C|D....D|E..| +--+---+-------+----------+------+---+ Now tooling decides to read from this ring buffer. However, none of these two natural positions, 'head' and the start of this ring buffer, are pointing to the head of a record. Even the full ring buffer can be accessed by tooling, it is unable to find a position to start decoding. The first attempt tries to solve this problem AFAIK can be found from [1]. It makes kernel to maintain 'tail' pointer: updates it when ring buffer is half full. However, this approach introduces overhead to fast path. Test result shows a 1% overhead [2]. In addition, this method utilizes no more tham 50% records. Another attempt can be found from [3], which allows putting the size of an event at the end of each record. This approach allows tooling to find records in a backward manner from 'head' pointer by reading size of a record from its tail. However, because of alignment requirement, it needs 8 bytes to record the size of a record, which is a huge waste. Its performance is also not good, because more data need to be written. This approach also introduces some extra branch instructions to fast path. 'write_backward' is a better solution to this problem. Following figure demonstrates the state of the overwritable ring buffer when 'write_backward' is set before overwriting: head | V +---+------+----------+-------+------+ | |D....D|C........C|B.....B|A....A| +---+------+----------+-------+------+ and after overwriting: head | V +---+------+----------+-------+---+--+ |..E|D....D|C........C|B.....B|A..|E.| +---+------+----------+-------+---+--+ In each situation, 'head' points to the beginning of the newest record. From this record, tooling can iterate over the full ring buffer and fetch records one by one. The only limitation that needs to be considered is back-to-back reading. Due to the non-deterministic of user programs, it is impossible to ensure the ring buffer keeps stable during reading. Consider an extreme situation: tooling is scheduled out after reading record 'D', then a burst of events come, eat up the whole ring buffer (one or multiple rounds). When the tooling process comes back, reading after 'D' is incorrect now. To prevent this problem, we need to find a way to ensure the ring buffer is stable during reading. ioctl(PERF_EVENT_IOC_PAUSE_OUTPUT) is suggested because its overhead is lower than ioctl(PERF_EVENT_IOC_ENABLE). By carefully verifying 'header' pointer, reader can avoid pausing the ring-buffer. For example: /* A union of all possible events */ union perf_event event; p = head = perf_mmap__read_head(); while (true) { /* copy header of next event */ fetch(&event.header, p, sizeof(event.header)); /* read 'head' pointer */ head = perf_mmap__read_head(); /* check overwritten: is the header good? */ if (!verify(sizeof(event.header), p, head)) break; /* copy the whole event */ fetch(&event, p, event.header.size); /* read 'head' pointer again */ head = perf_mmap__read_head(); /* is the whole event good? */ if (!verify(event.header.size, p, head)) break; p += event.header.size; } However, the overhead is high because: a) In-place decoding is not safe. Copying-verifying-decoding is required. b) Fetching 'head' pointer requires additional synchronization. (From Alexei Starovoitov: Even when this trick works, pause is needed for more than stability of reading. When we collect the events into overwrite buffer we're waiting for some other trigger (like all cpu utilization spike or just one cpu running and all others are idle) and when it happens the buffer has valuable info from the past. At this point new events are no longer interesting and buffer should be paused, events read and unpaused until next trigger comes.) This patch utilizes event's default overflow_handler introduced previously. perf_event_output_backward() is created as the default overflow handler for backward ring buffers. To avoid extra overhead to fast path, original perf_event_output() becomes __perf_event_output() and marked '__always_inline'. In theory, there's no extra overhead introduced to fast path. Performance testing: Calling 3000000 times of 'close(-1)', use gettimeofday() to check duration. Use 'perf record -o /dev/null -e raw_syscalls:*' to capture system calls. In ns. Testing environment: CPU : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz Kernel : v4.5.0 MEAN STDVAR BASE 800214.950 2853.083 PRE1 2253846.700 9997.014 PRE2 2257495.540 8516.293 POST 2250896.100 8933.921 Where 'BASE' is pure performance without capturing. 'PRE1' is test result of pure 'v4.5.0' kernel. 'PRE2' is test result before this patch. 'POST' is test result after this patch. See [4] for the detailed experimental setup. Considering the stdvar, this patch doesn't introduce performance overhead to the fast path. [1] http://lkml.iu.edu/hypermail/linux/kernel/1304.1/04584.html [2] http://lkml.iu.edu/hypermail/linux/kernel/1307.1/00535.html [3] http://lkml.iu.edu/hypermail/linux/kernel/1512.0/01265.html [4] http://lkml.kernel.org/g/[email protected] Signed-off-by: Wang Nan <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Brendan Gregg <[email protected]> Cc: He Kuang <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Masami Hiramatsu <[email protected]> Cc: Namhyung Kim <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Stephane Eranian <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vince Weaver <[email protected]> Cc: Zefan Li <[email protected]> Link: http://lkml.kernel.org/r/[email protected] [ Fixed the changelog some more. ] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2016-04-23Merge branch 'perf/urgent' into perf/core, to resolve conflictIngo Molnar3-3/+15
Signed-off-by: Ingo Molnar <[email protected]>
2016-04-23perf/core: Make sysctl_perf_cpu_time_max_percent conform to documentationPeter Zijlstra1-1/+2
Markus reported that 0 should also disable the throttling we per Documentation/sysctl/kernel.txt. Reported-by: Markus Trippelsdorf <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Stephane Eranian <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vince Weaver <[email protected]> Fixes: 91a612eea9a3 ("perf/core: Fix dynamic interrupt throttle") Signed-off-by: Ingo Molnar <[email protected]>
2016-04-16Merge branch 'locking-urgent-for-linus' of ↵Linus Torvalds1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fixlet from Ingo Molnar: "Fixes a build warning on certain Kconfig combinations" * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/lockdep: Fix print_collision() unused warning
2016-04-14/proc/iomem: only expose physical resource addresses to privileged usersLinus Torvalds1-2/+11
In commit c4004b02f8e5b ("x86: remove the kernel code/data/bss resources from /proc/iomem") I was hoping to remove the phyiscal kernel address data from /proc/iomem entirely, but that had to be reverted because some system programs actually use it. This limits all the detailed resource information to properly credentialed users instead. Signed-off-by: Linus Torvalds <[email protected]>
2016-04-13Merge tag 'v4.6-rc3' into perf/core, to refresh the treeIngo Molnar89-1299/+3485
Signed-off-by: Ingo Molnar <[email protected]>
2016-04-04Merge branch 'PAGE_CACHE_SIZE-removal'Linus Torvalds1-4/+4
Merge PAGE_CACHE_SIZE removal patches from Kirill Shutemov: "PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time ago with promise that one day it will be possible to implement page cache with bigger chunks than PAGE_SIZE. This promise never materialized. And unlikely will. Let's stop pretending that pages in page cache are special. They are not. The first patch with most changes has been done with coccinelle. The second is manual fixups on top. The third patch removes macros definition" [ I was planning to apply this just before rc2, but then I spaced out, so here it is right _after_ rc2 instead. As Kirill suggested as a possibility, I could have decided to only merge the first two patches, and leave the old interfaces for compatibility, but I'd rather get it all done and any out-of-tree modules and patches can trivially do the converstion while still also working with older kernels, so there is little reason to try to maintain the redundant legacy model. - Linus ] * PAGE_CACHE_SIZE-removal: mm: drop PAGE_CACHE_* and page_cache_{get,release} definition mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
2016-04-04mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macrosKirill A. Shutemov1-4/+4
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time ago with promise that one day it will be possible to implement page cache with bigger chunks than PAGE_SIZE. This promise never materialized. And unlikely will. We have many places where PAGE_CACHE_SIZE assumed to be equal to PAGE_SIZE. And it's constant source of confusion on whether PAGE_CACHE_* or PAGE_* constant should be used in a particular case, especially on the border between fs and mm. Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much breakage to be doable. Let's stop pretending that pages in page cache are special. They are not. The changes are pretty straight-forward: - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN}; - page_cache_get() -> get_page(); - page_cache_release() -> put_page(); This patch contains automated changes generated with coccinelle using script below. For some reason, coccinelle doesn't patch header files. I've called spatch for them manually. The only adjustment after coccinelle is revert of changes to PAGE_CAHCE_ALIGN definition: we are going to drop it later. There are few places in the code where coccinelle didn't reach. I'll fix them manually in a separate patch. Comments and documentation also will be addressed with the separate patch. virtual patch @@ expression E; @@ - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ expression E; @@ - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ @@ - PAGE_CACHE_SHIFT + PAGE_SHIFT @@ @@ - PAGE_CACHE_SIZE + PAGE_SIZE @@ @@ - PAGE_CACHE_MASK + PAGE_MASK @@ expression E; @@ - PAGE_CACHE_ALIGN(E) + PAGE_ALIGN(E) @@ expression E; @@ - page_cache_get(E) + get_page(E) @@ expression E; @@ - page_cache_release(E) + put_page(E) Signed-off-by: Kirill A. Shutemov <[email protected]> Acked-by: Michal Hocko <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-04-04locking/lockdep: Fix print_collision() unused warningBorislav Petkov1-0/+2
Fix this: kernel/locking/lockdep.c:2051:13: warning: ‘print_collision’ defined but not used [-Wunused-function] static void print_collision(struct task_struct *curr, ^ Signed-off-by: Borislav Petkov <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-04-03Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds1-2/+13
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "Misc kernel side fixes: - fix event leak - fix AMD PMU driver bug - fix core event handling bug - fix build bug on certain randconfigs Plus misc tooling fixes" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/amd/ibs: Fix pmu::stop() nesting perf/core: Don't leak event in the syscall error path perf/core: Fix time tracking bug with multiplexing perf jit: genelf makes assumptions about endian perf hists: Fix determination of a callchain node's childlessness perf tools: Add missing initialization of perf_sample.cpumode in synthesized samples perf tools: Fix build break on powerpc perf/x86: Move events_sysfs_show() outside CPU_SUP_INTEL perf bench: Fix detached tarball building due to missing 'perf bench memcpy' headers perf tests: Fix tarpkg build test error output redirection
2016-04-03Merge branch 'core-urgent-for-linus' of ↵Linus Torvalds4-34/+126
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core kernel fixes from Ingo Molnar: "This contains the nohz/atomic cleanup/fix for the fetch_or() ugliness you noted during the original nohz pull request, plus there's also misc fixes: - fix liblockdep build bug - fix uapi header build bug - print more lockdep hash collision info to help debug recent reports of hash collisions - update MAINTAINERS email address" * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: MAINTAINERS: Update my email address locking/lockdep: Print chain_key collision information uapi/linux/stddef.h: Provide __always_inline to userspace headers tools/lib/lockdep: Fix unsupported 'basename -s' in run_tests.sh locking/atomic, sched: Unexport fetch_or() timers/nohz: Convert tick dependency mask to atomic_t locking/atomic: Introduce atomic_fetch_or()
2016-04-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds1-2/+4
Pull networking fixes from David Miller: 1) Missing device reference in IPSEC input path results in crashes during device unregistration. From Subash Abhinov Kasiviswanathan. 2) Per-queue ISR register writes not being done properly in macb driver, from Cyrille Pitchen. 3) Stats accounting bugs in bcmgenet, from Patri Gynther. 4) Lightweight tunnel's TTL and TOS were swapped in netlink dumps, from Quentin Armitage. 5) SXGBE driver has off-by-one in probe error paths, from Rasmus Villemoes. 6) Fix race in save/swap/delete options in netfilter ipset, from Vishwanath Pai. 7) Ageing time of bridge not set properly when not operating over a switchdev device. Fix from Haishuang Yan. 8) Fix GRO regression wrt nested FOU/GUE based tunnels, from Alexander Duyck. 9) IPV6 UDP code bumps wrong stats, from Eric Dumazet. 10) FEC driver should only access registers that actually exist on the given chipset, fix from Fabio Estevam. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (73 commits) net: mvneta: fix changing MTU when using per-cpu processing stmmac: fix MDIO settings Revert "stmmac: Fix 'eth0: No PHY found' regression" stmmac: fix TX normal DESC net: mvneta: use cache_line_size() to get cacheline size net: mvpp2: use cache_line_size() to get cacheline size net: mvpp2: fix maybe-uninitialized warning tun, bpf: fix suspicious RCU usage in tun_{attach, detach}_filter net: usb: cdc_ncm: adding Telit LE910 V2 mobile broadband card rtnl: fix msg size calculation in if_nlmsg_size() fec: Do not access unexisting register in Coldfire net: mvneta: replace MVNETA_CPU_D_CACHE_LINE_SIZE with L1_CACHE_BYTES net: mvpp2: replace MVPP2_CPU_D_CACHE_LINE_SIZE with L1_CACHE_BYTES net: dsa: mv88e6xxx: Clear the PDOWN bit on setup net: dsa: mv88e6xxx: Introduce _mv88e6xxx_phy_page_{read, write} bpf: make padding in bpf_tunnel_key explicit ipv6: udp: fix UDP_MIB_IGNOREDMULTI updates bnxt_en: Fix ethtool -a reporting. bnxt_en: Fix typo in bnxt_hwrm_set_pause_common(). bnxt_en: Implement proper firmware message padding. ...
2016-03-31locking/lockdep: Print chain_key collision informationAlfredo Alvarez Fernandez1-2/+77
A sequence of pairs [class_idx -> corresponding chain_key iteration] is printed for both the current held_lock chain and the cached chain. That exposes the two different class_idx sequences that led to that particular hash value. This helps with debugging hash chain collision reports. Signed-off-by: Alfredo Alvarez Fernandez <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/1459357416-19190-1-git-send-email-alfredoalvarezernandez@gmail.com Signed-off-by: Ingo Molnar <[email protected]>
2016-03-31perf/ring_buffer: Prepare writing into the ring-buffer from the endWang Nan1-6/+36
Convert perf_output_begin() to __perf_output_begin() and make the later function able to write records from the end of the ring-buffer. Following commits will utilize the 'backward' flag. This is the core patch to support writing to the ring-buffer backwards, which will be introduced by upcoming patches to support reading from overwritable ring-buffers. In theory, this patch should not introduce any extra performance overhead since we use always_inline, but it does not hurt to double check that assumption: When CONFIG_OPTIMIZE_INLINING is disabled, the output object is nearly identical to original one. See: http://lkml.kernel.org/g/[email protected] When CONFIG_OPTIMIZE_INLINING is enabled, the resuling object file becomes smaller: $ size kernel/events/ring_buffer.o* text data bss dec hex filename 4641 4 8 4653 122d kernel/events/ring_buffer.o.old 4545 4 8 4557 11cd kernel/events/ring_buffer.o.new Performance testing results: Calling 3000000 times of 'close(-1)', use gettimeofday() to check duration. Use 'perf record -o /dev/null -e raw_syscalls:*' to capture system calls. In ns. Testing environment: CPU : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz Kernel : v4.5.0 MEAN STDVAR BASE 800214.950 2853.083 PRE 2253846.700 9997.014 POST 2257495.540 8516.293 Where 'BASE' is pure performance without capturing. 'PRE' is test result of pure 'v4.5.0' kernel. 'POST' is test result after this patch. Considering the stdvar, this patch doesn't hurt performance, within noise margin. For testing details, see: http://lkml.kernel.org/g/[email protected] Signed-off-by: Wang Nan <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Brendan Gregg <[email protected]> Cc: He Kuang <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Masami Hiramatsu <[email protected]> Cc: Namhyung Kim <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Stephane Eranian <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vince Weaver <[email protected]> Cc: Zefan Li <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-03-31perf/core: Set event's default ::overflow_handler()Wang Nan1-6/+8
Set a default event->overflow_handler in perf_event_alloc() so don't need to check event->overflow_handler in __perf_event_overflow(). Following commits can give a different default overflow_handler. Initial idea comes from Peter: http://lkml.kernel.org/r/[email protected] Since the default value of event->overflow_handler is not NULL, existing 'if (!overflow_handler)' checks need to be changed. is_default_overflow_handler() is introduced for this. No extra performance overhead is introduced into the hot path because in the original code we still need to read this handler from memory. A conditional branch is avoided so actually we remove some instructions. Signed-off-by: Wang Nan <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Brendan Gregg <[email protected]> Cc: He Kuang <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Masami Hiramatsu <[email protected]> Cc: Namhyung Kim <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Stephane Eranian <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vince Weaver <[email protected]> Cc: Zefan Li <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-03-31perf/ring_buffer: Introduce new ioctl options to pause and resume the ↵Wang Nan3-1/+33
ring-buffer Add new ioctl() to pause/resume ring-buffer output. In some situations we want to read from the ring-buffer only when we ensure nothing can write to the ring-buffer during reading. Without this patch we have to turn off all events attached to this ring-buffer to achieve this. This patch is a prerequisite to enable overwrite support for the perf ring-buffer support. Following commits will introduce new methods support reading from overwrite ring buffer. Before reading, caller must ensure the ring buffer is frozen, or the reading is unreliable. Signed-off-by: Wang Nan <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Brendan Gregg <[email protected]> Cc: He Kuang <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Masami Hiramatsu <[email protected]> Cc: Namhyung Kim <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Stephane Eranian <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vince Weaver <[email protected]> Cc: Zefan Li <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-03-31ftrace/perf: Check sample types only for sampling eventsJiri Olsa1-0/+3
Currently we check sample type for ftrace:function events even if it's not created as a sampling event. That prevents creating ftrace_function event in counting mode. Make sure we check sample types only for sampling events. Before: $ sudo perf stat -e ftrace:function ls ... Performance counter stats for 'ls': <not supported> ftrace:function 0.001983662 seconds time elapsed After: $ sudo perf stat -e ftrace:function ls ... Performance counter stats for 'ls': 44,498 ftrace:function 0.037534722 seconds time elapsed Suggested-by: Namhyung Kim <[email protected]> Signed-off-by: Jiri Olsa <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Steven Rostedt <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Stephane Eranian <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vince Weaver <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-03-31perf/ring_buffer: Document AUX API usageAlexander Shishkin1-0/+10
In order to ensure safe AUX buffer management, we rely on the assumption that pmu::stop() stops its ongoing AUX transaction and not just the hw. This patch documents this requirement for the perf_aux_output_{begin,end}() APIs. Signed-off-by: Alexander Shishkin <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mathieu Poirier <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Stephane Eranian <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vince Weaver <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/1457098969-21595-4-git-send-email-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <[email protected]>
2016-03-31perf/core: Free AUX pages in unmap pathAlexander Shishkin3-29/+129
Now that we can ensure that when ring buffer's AUX area is on the way to getting unmapped new transactions won't start, we only need to stop all events that can potentially be writing aux data to our ring buffer. Having done that, we can safely free the AUX pages and corresponding PMU data, as this time it is guaranteed to be the last aux reference holder. This partially reverts: 57ffc5ca679 ("perf: Fix AUX buffer refcounting") ... which was made to defer deallocation that was otherwise possible from an NMI context. Now it is no longer the case; the last call to rb_free_aux() that drops the last AUX reference has to happen in perf_mmap_close() on that AUX area. Signed-off-by: Alexander Shishkin <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Stephane Eranian <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vince Weaver <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-03-31perf/ring_buffer: Refuse to begin AUX transaction after rb->aux_mmap_count dropsAlexander Shishkin1-0/+7
When ring buffer's AUX area is unmapped and rb->aux_mmap_count drops to zero, new AUX transactions into this buffer can still be started, even though the buffer in en route to deallocation. This patch adds a check to perf_aux_output_begin() for rb->aux_mmap_count being zero, in which case there is no point starting new transactions, in other words, the ring buffers that pass a certain point in perf_mmap_close will not have their events sending new data, which clears path for freeing those buffers' pages right there and then, provided that no active transactions are holding the AUX reference. Signed-off-by: Alexander Shishkin <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Stephane Eranian <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vince Weaver <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/1457098969-21595-2-git-send-email-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <[email protected]>
2016-03-31perf/core: Verify we have a single perf_hw_context PMUPeter Zijlstra1-0/+9
There should (and can) only be a single PMU for perf_hw_context events. This is because of how we schedule events: once a hardware event fails to schedule (the PMU is 'full') we stop trying to add more. The trivial 'fix' would break the Round-Robin scheduling we do. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Stephane Eranian <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vince Weaver <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2016-03-31perf/core: Don't leak event in the syscall error pathAlexander Shishkin1-0/+1
In the error path, event_file not being NULL is used to determine whether the event itself still needs to be free'd, so fix it up to avoid leaking. Reported-by: Leon Yu <[email protected]> Signed-off-by: Alexander Shishkin <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Stephane Eranian <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vince Weaver <[email protected]> Fixes: 130056275ade ("perf: Do not double free") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-03-31perf/core: Fix time tracking bug with multiplexingPeter Zijlstra1-2/+12
Stephane reported that commit: 3cbaa5906967 ("perf: Fix ctx time tracking by introducing EVENT_TIME") introduced a regression wrt. time tracking, as easily observed by: > This patch introduce a bug in the time tracking of events when > multiplexing is used. > > The issue is easily reproducible with the following perf run: > > $ perf stat -a -C 0 -e branches,branches,branches,branches,branches,branches -I 1000 > 1.000730239 652,394 branches (66.41%) > 1.000730239 597,809 branches (66.41%) > 1.000730239 593,870 branches (66.63%) > 1.000730239 651,440 branches (67.03%) > 1.000730239 656,725 branches (66.96%) > 1.000730239 <not counted> branches > > One branches event is shown as not having run. Yet, with > multiplexing, all events should run especially with a 1s (-I 1000) > interval. The delta for time_running comes out to 0. Yet, the event > has run because the kernel is actually multiplexing the events. The > problem is that the time tracking is the kernel and especially in > ctx_sched_out() is wrong now. > > The problem is that in case that the kernel enters ctx_sched_out() with the > following state: > ctx->is_active=0x7 event_type=0x1 > Call Trace: > [<ffffffff813ddd41>] dump_stack+0x63/0x82 > [<ffffffff81182bdc>] ctx_sched_out+0x2bc/0x2d0 > [<ffffffff81183896>] perf_mux_hrtimer_handler+0xf6/0x2c0 > [<ffffffff811837a0>] ? __perf_install_in_context+0x130/0x130 > [<ffffffff810f5818>] __hrtimer_run_queues+0xf8/0x2f0 > [<ffffffff810f6097>] hrtimer_interrupt+0xb7/0x1d0 > [<ffffffff810509a8>] local_apic_timer_interrupt+0x38/0x60 > [<ffffffff8175ca9d>] smp_apic_timer_interrupt+0x3d/0x50 > [<ffffffff8175ac7c>] apic_timer_interrupt+0x8c/0xa0 > > In that case, the test: > if (is_active & EVENT_TIME) > > will be false and the time will not be updated. Time must always be updated on > sched out. Fix this by always updating time if EVENT_TIME was set, as opposed to only updating time when EVENT_TIME changed. Reported-by: Stephane Eranian <[email protected]> Tested-by: Stephane Eranian <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vince Weaver <[email protected]> Cc: [email protected] Cc: [email protected] Fixes: 3cbaa5906967 ("perf: Fix ctx time tracking by introducing EVENT_TIME") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-03-29locking/atomic, sched: Unexport fetch_or()Frederic Weisbecker1-0/+18
This patch functionally reverts: 5fd7a09cfb8c ("atomic: Export fetch_or()") During the merge Linus observed that the generic version of fetch_or() was messy: " This makes the ugly "fetch_or()" macro that the scheduler used internally a new generic helper, and does a bad job at it. " e23604edac2a Merge branch 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Now that we have introduced atomic_fetch_or(), fetch_or() is only used by the scheduler in order to deal with thread_info flags which type can vary across architectures. Lets confine fetch_or() back to the scheduler so that we encourage future users to use the more robust and well typed atomic_t version instead. While at it, fetch_or() gets robustified, pasting improvements from a previous patch by Ingo Molnar that avoids needless expression re-evaluations in the loop. Reported-by: Linus Torvalds <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-03-29timers/nohz: Convert tick dependency mask to atomic_tFrederic Weisbecker2-32/+31
The tick dependency mask was intially unsigned long because this is the type on which clear_bit() operates on and fetch_or() accepts it. But now that we have atomic_fetch_or(), we can instead use atomic_andnot() to clear the bit. This consolidates the type of our tick dependency mask, reduce its size on structures and benefit from possible architecture optimizations on atomic_t operations. Suggested-by: Linus Torvalds <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-03-25arch, ftrace: for KASAN put hard/soft IRQ entries into separate sectionsAlexander Potapenko2-1/+2
KASAN needs to know whether the allocation happens in an IRQ handler. This lets us strip everything below the IRQ entry point to reduce the number of unique stack traces needed to be stored. Move the definition of __irq_entry to <linux/interrupt.h> so that the users don't need to pull in <linux/ftrace.h>. Also introduce the __softirq_entry macro which is similar to __irq_entry, but puts the corresponding functions to the .softirqentry.text section. Signed-off-by: Alexander Potapenko <[email protected]> Acked-by: Steven Rostedt <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Andrey Konovalov <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Konstantin Serebryany <[email protected]> Cc: Dmitry Chernenkov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-03-25oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address spaceMichal Hocko1-1/+1
When oom_reaper manages to unmap all the eligible vmas there shouldn't be much of the freable memory held by the oom victim left anymore so it makes sense to clear the TIF_MEMDIE flag for the victim and allow the OOM killer to select another task. The lack of TIF_MEMDIE also means that the victim cannot access memory reserves anymore but that shouldn't be a problem because it would get the access again if it needs to allocate and hits the OOM killer again due to the fatal_signal_pending resp. PF_EXITING check. We can safely hide the task from the OOM killer because it is clearly not a good candidate anymore as everyhing reclaimable has been torn down already. This patch will allow to cap the time an OOM victim can keep TIF_MEMDIE and thus hold off further global OOM killer actions granted the oom reaper is able to take mmap_sem for the associated mm struct. This is not guaranteed now but further steps should make sure that mmap_sem for write should be blocked killable which will help to reduce such a lock contention. This is not done by this patch. Note that exit_oom_victim might be called on a remote task from __oom_reap_task now so we have to check and clear the flag atomically otherwise we might race and underflow oom_victims or wake up waiters too early. Signed-off-by: Michal Hocko <[email protected]> Suggested-by: Johannes Weiner <[email protected]> Suggested-by: Tetsuo Handa <[email protected]> Cc: Andrea Argangeli <[email protected]> Cc: David Rientjes <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-03-25sched: add schedule_timeout_idle()Andrew Morton1-0/+11
This will be needed in the patch "mm, oom: introduce oom reaper". Acked-by: Michal Hocko <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Peter Zijlstra <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-03-25bpf: add missing map_flags to bpf_map_show_fdinfoDaniel Borkmann1-2/+4
Add map_flags attribute to bpf_map_show_fdinfo(), so that tools like tc can check for them when loading objects from a pinned entry, e.g. if user intent wrt allocation (BPF_F_NO_PREALLOC) is different to the pinned object, it can bail out. Follow-up to 6c9059817432 ("bpf: pre-allocate hash map elements"), so that tc can still support this with v4.6. Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-03-24Merge tag 'pm+acpi-4.6-rc1-2' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull more power management and ACPI updates from Rafael Wysocki: "The second batch of power management and ACPI updates for v4.6. Included are fixups on top of the previous PM/ACPI pull request and other material that didn't make into it but still should go into 4.6. Among other things, there's a fix for an intel_pstate driver issue uncovered by recent cpufreq changes, a workaround for a boot hang on Skylake-H related to the handling of deep C-states by the platform and a PCI/ACPI fix for the handling of IO port resources on non-x86 architectures plus some new device IDs and similar. Specifics: - Fix for an intel_pstate driver issue related to the handling of MSR updates uncovered by the recent cpufreq rework (Rafael Wysocki). - cpufreq core cleanups related to starting governors and frequency synchronization during resume from system suspend and a locking fix for cpufreq_quick_get() (Rafael Wysocki, Richard Cochran). - acpi-cpufreq and powernv cpufreq driver updates (Jisheng Zhang, Michael Neuling, Richard Cochran, Shilpasri Bhat). - intel_idle driver update preventing some Skylake-H systems from hanging during initialization by disabling deep C-states mishandled by the platform in the problematic configurations (Len Brown). - Intel Xeon Phi Processor x200 support for intel_idle (Dasaratharaman Chandramouli). - cpuidle menu governor updates to make it always honor PM QoS latency constraints (and prevent C1 from being used as the fallback C-state on x86 when they are set below its exit latency) and to restore the previous behavior to fall back to C1 if the next timer event is set far enough in the future that was changed in 4.4 which led to an energy consumption regression (Rik van Riel, Rafael Wysocki). - New device ID for a future AMD UART controller in the ACPI driver for AMD SoCs (Wang Hongcheng). - Rockchip rk3399 support for the rockchip-io-domain adaptive voltage scaling (AVS) driver (David Wu). - ACPI PCI resources management fix for the handling of IO space resources on architectures where the IO space is memory mapped (IA64 and ARM64) broken by the introduction of common ACPI resources parsing for PCI host bridges in 4.4 (Lorenzo Pieralisi). - Fix for the ACPI backend of the generic device properties API to make it parse non-device (data node only) children of an ACPI device correctly (Irina Tirdea). - Fixes for the handling of global suspend flags (introduced in 4.4) during hibernation and resume from it (Lukas Wunner). - Support for obtaining configuration information from Device Trees in the PM clocks framework (Jon Hunter). - ACPI _DSM helper code and devfreq framework cleanups (Colin Ian King, Geert Uytterhoeven)" * tag 'pm+acpi-4.6-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (23 commits) PM / AVS: rockchip-io: add io selectors and supplies for rk3399 intel_idle: Support for Intel Xeon Phi Processor x200 Product Family intel_idle: prevent SKL-H boot failure when C8+C9+C10 enabled ACPI / PM: Runtime resume devices when waking from hibernate PM / sleep: Clear pm_suspend_global_flags upon hibernate cpufreq: governor: Always schedule work on the CPU running update cpufreq: Always update current frequency before startig governor cpufreq: Introduce cpufreq_update_current_freq() cpufreq: Introduce cpufreq_start_governor() cpufreq: powernv: Add sysfs attributes to show throttle stats cpufreq: acpi-cpufreq: make Intel/AMD MSR access, io port access static PCI: ACPI: IA64: fix IO port generic range check ACPI / util: cast data to u64 before shifting to fix sign extension cpufreq: powernv: Define per_cpu chip pointer to optimize hot-path cpuidle: menu: Fall back to polling if next timer event is near cpufreq: acpi-cpufreq: Clean up hot plug notifier callback intel_pstate: Do not call wrmsrl_on_cpu() with disabled interrupts cpufreq: Make cpufreq_quick_get() safe to call ACPI / property: fix data node parsing in acpi_get_next_subnode() ACPI / APD: Add device HID for future AMD UART controller ...
2016-03-25Merge branches 'pm-avs', 'pm-clk', 'pm-devfreq' and 'pm-sleep'Rafael J. Wysocki1-0/+1
* pm-avs: PM / AVS: rockchip-io: add io selectors and supplies for rk3399 * pm-clk: PM / clk: Add support for obtaining clocks from device-tree * pm-devfreq: PM / devfreq: Spelling s/frequnecy/frequency/ * pm-sleep: ACPI / PM: Runtime resume devices when waking from hibernate PM / sleep: Clear pm_suspend_global_flags upon hibernate
2016-03-24Merge tag 'trace-v4.6' of ↵Linus Torvalds11-111/+222
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing updates from Steven Rostedt: "Nothing major this round. Mostly small clean ups and fixes. Some visible changes: - A new flag was added to distinguish traces done in NMI context. - Preempt tracer now shows functions where preemption is disabled but interrupts are still enabled. Other notes: - Updates were done to function tracing to allow better performance with perf. - Infrastructure code has been added to allow for a new histogram feature for recording live trace event histograms that can be configured by simple user commands. The feature itself was just finished, but needs a round in linux-next before being pulled. This only includes some infrastructure changes that will be needed" * tag 'trace-v4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (22 commits) tracing: Record and show NMI state tracing: Fix trace_printk() to print when not using bprintk() tracing: Remove redundant reset per-CPU buff in irqsoff tracer x86: ftrace: Fix the misleading comment for arch/x86/kernel/ftrace.c tracing: Fix crash from reading trace_pipe with sendfile tracing: Have preempt(irqs)off trace preempt disabled functions tracing: Fix return while holding a lock in register_tracer() ftrace: Use kasprintf() in ftrace_profile_tracefs() ftrace: Update dynamic ftrace calls only if necessary ftrace: Make ftrace_hash_rec_enable return update bool tracing: Fix typoes in code comment and printk in trace_nop.c tracing, writeback: Replace cgroup path to cgroup ino tracing: Use flags instead of bool in trigger structure tracing: Add an unreg_all() callback to trigger commands tracing: Add needs_rec flag to event triggers tracing: Add a per-event-trigger 'paused' field tracing: Add get_syscall_name() tracing: Add event record param to trigger_ops.func() tracing: Make event trigger functions available tracing: Make ftrace_event_field checking functions available ...
2016-03-24Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds2-38/+82
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "This tree contains various perf fixes on the kernel side, plus three hw/event-enablement late additions: - Intel Memory Bandwidth Monitoring events and handling - the AMD Accumulated Power Mechanism reporting facility - more IOMMU events ... and a final round of perf tooling updates/fixes" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits) perf llvm: Use strerror_r instead of the thread unsafe strerror one perf llvm: Use realpath to canonicalize paths perf tools: Unexport some methods unused outside strbuf.c perf probe: No need to use formatting strbuf method perf help: Use asprintf instead of adhoc equivalents perf tools: Remove unused perf_pathdup, xstrdup functions perf tools: Do not include stringify.h from the kernel sources tools include: Copy linux/stringify.h from the kernel tools lib traceevent: Remove redundant CPU output perf tools: Remove needless 'extern' from function prototypes perf tools: Simplify die() mechanism perf tools: Remove unused DIE_IF macro perf script: Remove lots of unused arguments perf thread: Rename perf_event__preprocess_sample_addr to thread__resolve perf machine: Rename perf_event__preprocess_sample to machine__resolve perf tools: Add cpumode to struct perf_sample perf tests: Forward the perf_sample in the dwarf unwind test perf tools: Remove misplaced __maybe_unused perf list: Fix documentation of :ppp perf bench numa: Fix assertion for nodes bitfield ...
2016-03-24Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds5-55/+72
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Misc fixes: a cgroup fix, a fair-scheduler migration accounting fix, a cputime fix and two cpuacct cleanups" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/cpuacct: Simplify the cpuacct code sched/cpuacct: Rename parameter in cpuusage_write() for readability sched/fair: Add comments to explain select_idle_sibling() sched/fair: Fix fairness issue on migration sched/cgroup: Fix/cleanup cgroup teardown/init sched/cputime: Fix steal time accounting vs. CPU hotplug
2016-03-23PM / sleep: Clear pm_suspend_global_flags upon hibernateLukas Wunner1-0/+1
When suspending to RAM, waking up and later suspending to disk, we gratuitously runtime resume devices after the thaw phase. This does not occur if we always suspend to RAM or always to disk. pm_complete_with_resume_check(), which gets called from pci_pm_complete() among others, schedules a runtime resume if PM_SUSPEND_FLAG_FW_RESUME is set. The flag is set during a suspend-to-RAM cycle. It is cleared at the beginning of the suspend-to-RAM cycle but not afterwards and it is not cleared during a suspend-to-disk cycle at all. Fix it. Fixes: ef25ba047601 (PM / sleep: Add flags to indicate platform firmware involvement) Signed-off-by: Lukas Wunner <[email protected]> Cc: 4.4+ <[email protected]> # 4.4+ Signed-off-by: Rafael J. Wysocki <[email protected]>
2016-03-22kernel/...: convert pr_warning to pr_warnJoe Perches13-60/+52
Use the more common logging method with the eventual goal of removing pr_warning altogether. Miscellanea: - Realign arguments - Coalesce formats - Add missing space between a few coalesced formats Signed-off-by: Joe Perches <[email protected]> Acked-by: Rafael J. Wysocki <[email protected]> [kernel/power/suspend.c] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-03-22memremap: add MEMREMAP_WC flagBrian Starkey1-2/+11
Add a flag to memremap() for writecombine mappings. Mappings satisfied by this flag will not be cached, however writes may be delayed or combined into more efficient bursts. This is most suitable for buffers written sequentially by the CPU for use by other DMA devices. Signed-off-by: Brian Starkey <[email protected]> Reviewed-by: Catalin Marinas <[email protected]> Cc: Dan Williams <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-03-22memremap: don't modify flagsBrian Starkey1-7/+7
These patches implement a MEMREMAP_WC flag for memremap(), which can be used to obtain writecombine mappings. This is then used for setting up dma_coherent_mem regions which use the DMA_MEMORY_MAP flag. The motivation is to fix an alignment fault on arm64, and the suggestion to implement MEMREMAP_WC for this case was made at [1]. That particular issue is handled in patch 4, which makes sure that the appropriate memset function is used when zeroing allocations mapped as IO memory. This patch (of 4): Don't modify the flags input argument to memremap(). MEMREMAP_WB is already a special case so we can check for it directly instead of clearing flag bits in each mapper. Signed-off-by: Brian Starkey <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Dan Williams <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-03-22kernel/signal.c: add compile-time check for __ARCH_SI_PREAMBLE_SIZEHelge Deller1-0/+4
The value of __ARCH_SI_PREAMBLE_SIZE defines the size (including padding) of the part of the struct siginfo that is before the union, and it is then used to calculate the needed padding (SI_PAD_SIZE) to make the size of struct siginfo equal to 128 (SI_MAX_SIZE) bytes. Depending on the target architecture and word width it equals to either 3 or 4 times sizeof int. Since the very beginning we had __ARCH_SI_PREAMBLE_SIZE wrong on the parisc architecture for the 64bit kernel build. It's even more frustrating, because it can easily be checked at compile time if the value was defined correctly. This patch adds such a check for the correctness of __ARCH_SI_PREAMBLE_SIZE in the hope that it will prevent existing and future architectures from running into the same problem. I refrained from replacing __ARCH_SI_PREAMBLE_SIZE by offsetof() in copy_siginfo() in include/asm-generic/siginfo.h, because a) it doesn't make any difference and b) it's used in the Documentation/kmemcheck.txt example. I ran this patch through the 0-DAY kernel test infrastructure and only the parisc architecture triggered as expected. That means that this patch should be OK for all major architectures. Signed-off-by: Helge Deller <[email protected]> Cc: Stephen Rothwell <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Oleg Nesterov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-03-22kernel: add kcov code coverageDmitry Vyukov7-0/+301
kcov provides code coverage collection for coverage-guided fuzzing (randomized testing). Coverage-guided fuzzing is a testing technique that uses coverage feedback to determine new interesting inputs to a system. A notable user-space example is AFL (http://lcamtuf.coredump.cx/afl/). However, this technique is not widely used for kernel testing due to missing compiler and kernel support. kcov does not aim to collect as much coverage as possible. It aims to collect more or less stable coverage that is function of syscall inputs. To achieve this goal it does not collect coverage in soft/hard interrupts and instrumentation of some inherently non-deterministic or non-interesting parts of kernel is disbled (e.g. scheduler, locking). Currently there is a single coverage collection mode (tracing), but the API anticipates additional collection modes. Initially I also implemented a second mode which exposes coverage in a fixed-size hash table of counters (what Quentin used in his original patch). I've dropped the second mode for simplicity. This patch adds the necessary support on kernel side. The complimentary compiler support was added in gcc revision 231296. We've used this support to build syzkaller system call fuzzer, which has found 90 kernel bugs in just 2 months: https://github.com/google/syzkaller/wiki/Found-Bugs We've also found 30+ bugs in our internal systems with syzkaller. Another (yet unexplored) direction where kcov coverage would greatly help is more traditional "blob mutation". For example, mounting a random blob as a filesystem, or receiving a random blob over wire. Why not gcov. Typical fuzzing loop looks as follows: (1) reset coverage, (2) execute a bit of code, (3) collect coverage, repeat. A typical coverage can be just a dozen of basic blocks (e.g. an invalid input). In such context gcov becomes prohibitively expensive as reset/collect coverage steps depend on total number of basic blocks/edges in program (in case of kernel it is about 2M). Cost of kcov depends only on number of executed basic blocks/edges. On top of that, kernel requires per-thread coverage because there are always background threads and unrelated processes that also produce coverage. With inlined gcov instrumentation per-thread coverage is not possible. kcov exposes kernel PCs and control flow to user-space which is insecure. But debugfs should not be mapped as user accessible. Based on a patch by Quentin Casasnovas. [[email protected]: make task_struct.kcov_mode have type `enum kcov_mode'] [[email protected]: unbreak allmodconfig] [[email protected]: follow x86 Makefile layout standards] Signed-off-by: Dmitry Vyukov <[email protected]> Reviewed-by: Kees Cook <[email protected]> Cc: syzkaller <[email protected]> Cc: Vegard Nossum <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Tavis Ormandy <[email protected]> Cc: Will Deacon <[email protected]> Cc: Quentin Casasnovas <[email protected]> Cc: Kostya Serebryany <[email protected]> Cc: Eric Dumazet <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Kees Cook <[email protected]> Cc: Bjorn Helgaas <[email protected]> Cc: Sasha Levin <[email protected]> Cc: David Drysdale <[email protected]> Cc: Ard Biesheuvel <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Jiri Slaby <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-03-22profile: hide unused functions when !CONFIG_PROC_FSArnd Bergmann1-2/+2
A couple of functions and variables in the profile implementation are used only on SMP systems by the procfs code, but are unused if either procfs is disabled or in uniprocessor kernels. gcc prints a harmless warning about the unused symbols: kernel/profile.c:243:13: error: 'profile_flip_buffers' defined but not used [-Werror=unused-function] static void profile_flip_buffers(void) ^ kernel/profile.c:266:13: error: 'profile_discard_flip_buffers' defined but not used [-Werror=unused-function] static void profile_discard_flip_buffers(void) ^ kernel/profile.c:330:12: error: 'profile_cpu_callback' defined but not used [-Werror=unused-function] static int profile_cpu_callback(struct notifier_block *info, ^ This adds further #ifdef to the file, to annotate exactly in which cases they are used. I have done several thousand ARM randconfig kernels with this patch applied and no longer get any warnings in this file. Signed-off-by: Arnd Bergmann <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Robin Holt <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-03-22panic: change nmi_panic from macro to functionHidehiro Kawai1-0/+20
Commit 1717f2096b54 ("panic, x86: Fix re-entrance problem due to panic on NMI") and commit 58c5661f2144 ("panic, x86: Allow CPUs to save registers even if looping in NMI context") introduced nmi_panic() which prevents concurrent/recursive execution of panic(). It also saves registers for the crash dump on x86. However, there are some cases where NMI handlers still use panic(). This patch set partially replaces them with nmi_panic() in those cases. Even this patchset is applied, some NMI or similar handlers (e.g. MCE handler) continue to use panic(). This is because I can't test them well and actual problems won't happen. For example, the possibility that normal panic and panic on MCE happen simultaneously is very low. This patch (of 3): Convert nmi_panic() to a proper function and export it instead of exporting internal implementation details to modules, for obvious reasons. Signed-off-by: Hidehiro Kawai <[email protected]> Acked-by: Borislav Petkov <[email protected]> Acked-by: Michal Nazarewicz <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Rasmus Villemoes <[email protected]> Cc: Nicolas Iooss <[email protected]> Cc: Javi Merino <[email protected]> Cc: Gobinda Charan Maji <[email protected]> Cc: "Steven Rostedt (Red Hat)" <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Cc: HATAYAMA Daisuke <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-03-22fs/coredump: prevent fsuid=0 dumps into user-controlled directoriesJann Horn1-1/+1
This commit fixes the following security hole affecting systems where all of the following conditions are fulfilled: - The fs.suid_dumpable sysctl is set to 2. - The kernel.core_pattern sysctl's value starts with "/". (Systems where kernel.core_pattern starts with "|/" are not affected.) - Unprivileged user namespace creation is permitted. (This is true on Linux >=3.8, but some distributions disallow it by default using a distro patch.) Under these conditions, if a program executes under secure exec rules, causing it to run with the SUID_DUMP_ROOT flag, then unshares its user namespace, changes its root directory and crashes, the coredump will be written using fsuid=0 and a path derived from kernel.core_pattern - but this path is interpreted relative to the root directory of the process, allowing the attacker to control where a coredump will be written with root privileges. To fix the security issue, always interpret core_pattern for dumps that are written under SUID_DUMP_ROOT relative to the root directory of init. Signed-off-by: Jann Horn <[email protected]> Acked-by: Kees Cook <[email protected]> Cc: Al Viro <[email protected]> Cc: "Eric W. Biederman" <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-03-22ptrace: change __ptrace_unlink() to clear ->ptrace under ->siglockOleg Nesterov1-2/+1
This test-case (simplified version of generated by syzkaller) #include <unistd.h> #include <sys/ptrace.h> #include <sys/wait.h> void test(void) { for (;;) { if (fork()) { wait(NULL); continue; } ptrace(PTRACE_SEIZE, getppid(), 0, 0); ptrace(PTRACE_INTERRUPT, getppid(), 0, 0); _exit(0); } } int main(void) { int np; for (np = 0; np < 8; ++np) if (!fork()) test(); while (wait(NULL) > 0) ; return 0; } triggers the 2nd WARN_ON_ONCE(!signr) warning in do_jobctl_trap(). The problem is that __ptrace_unlink() clears task->jobctl under siglock but task->ptrace is cleared without this lock held; this fools the "else" branch which assumes that !PT_SEIZED means PT_PTRACED. Note also that most of other PTRACE_SEIZE checks can race with detach from the exiting tracer too. Say, the callers of ptrace_trap_notify() assume that SEIZED can't go away after it was checked. Signed-off-by: Oleg Nesterov <[email protected]> Reported-by: Dmitry Vyukov <[email protected]> Cc: Tejun Heo <[email protected]> Cc: syzkaller <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-03-22auditsc: for seccomp events, log syscall compat state using in_compat_syscallAndy Lutomirski1-2/+2
Except on SPARC, this is what the code always did. SPARC compat seccomp was buggy, although the impact of the bug was limited because SPARC 32-bit and 64-bit syscall numbers are the same. Signed-off-by: Andy Lutomirski <[email protected]> Cc: Paul Moore <[email protected]> Cc: Eric Paris <[email protected]> Cc: David Miller <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-03-22ptrace: in PEEK_SIGINFO, check syscall bitness, not task bitnessAndy Lutomirski1-1/+1
Users of the 32-bit ptrace() ABI expect the full 32-bit ABI. siginfo translation should check ptrace() ABI, not caller task ABI. This is an ABI change on SPARC. Let's hope that no one relied on the old buggy ABI. Signed-off-by: Andy Lutomirski <[email protected]> Cc: Oleg Nesterov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-03-22seccomp: check in_compat_syscall, not is_compat_task, in strict modeAndy Lutomirski1-2/+2
Seccomp wants to know the syscall bitness, not the caller task bitness, when it selects the syscall whitelist. As far as I know, this makes no difference on any architecture, so it's not a security problem. (It generates identical code everywhere except sparc, and, on sparc, the syscall numbering is the same for both ABIs.) Signed-off-by: Andy Lutomirski <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-03-22kernel/hung_task.c: use timeout diff when timeout is updatedTetsuo Handa1-8/+13
When new timeout is written to /proc/sys/kernel/hung_task_timeout_secs, khungtaskd is interrupted and again sleeps for full timeout duration. This means that hang task will not be checked if new timeout is written periodically within old timeout duration and/or checking of hang task will be delayed for up to previous timeout duration. Fix this by remembering last time khungtaskd checked hang task. This change will allow other watchdog tasks (if any) to share khungtaskd by sleeping for minimal timeout diff of all watchdog tasks. Doing more watchdog tasks from khungtaskd will reduce the possibility of printk() collisions by multiple watchdog threads. Signed-off-by: Tetsuo Handa <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Aaron Tomlin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-03-22tracing: Record and show NMI statePeter Zijlstra3-3/+9
The latency tracer format has a nice column to indicate IRQ state, but this is not able to tell us about NMI state. When tracing perf interrupt handlers (which often run in NMI context) it is very useful to see how the events nest. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2016-03-22tracing: Fix trace_printk() to print when not using bprintk()Steven Rostedt (Red Hat)1-0/+3
The trace_printk() code will allocate extra buffers if the compile detects that a trace_printk() is used. To do this, the format of the trace_printk() is saved to the __trace_printk_fmt section, and if that section is bigger than zero, the buffers are allocated (along with a message that this has happened). If trace_printk() uses a format that is not a constant, and thus something not guaranteed to be around when the print happens, the compiler optimizes the fmt out, as it is not used, and the __trace_printk_fmt section is not filled. This means the kernel will not allocate the special buffers needed for the trace_printk() and the trace_printk() will not write anything to the tracing buffer. Adding a "__used" to the variable in the __trace_printk_fmt section will keep it around, even though it is set to NULL. This will keep the string from being printed in the debugfs/tracing/printk_formats section as it is not needed. Reported-by: Vlastimil Babka <[email protected]> Fixes: 07d777fe8c398 "tracing: Add percpu buffers for trace_printk()" Cc: [email protected] # v3.5+ Signed-off-by: Steven Rostedt <[email protected]>