Age | Commit message (Collapse) | Author | Files | Lines |
|
Sometimes it's useful to organize member fields in cache-line boundary.
The 'typecln' sort key is short for type-cacheline and to show samples
in each cacheline. The cacheline size is fixed to 64 for now, but it
can read the actual size once it saves the value from sysfs.
For example, you maybe want to which cacheline in a target is hot or
cold. The following shows members in the cfs_rq's first cache line.
$ perf report -s type,typecln,typeoff -H
...
- 2.67% struct cfs_rq
+ 1.23% struct cfs_rq: cache-line 2
+ 0.57% struct cfs_rq: cache-line 4
+ 0.46% struct cfs_rq: cache-line 6
- 0.41% struct cfs_rq: cache-line 0
0.39% struct cfs_rq +0x14 (h_nr_running)
0.02% struct cfs_rq +0x38 (tasks_timeline.rb_leftmost)
...
Committer testing:
# root@number:~# perf report -s type,typecln,typeoff -H --stdio
# Total Lost Samples: 0
#
# Samples: 5K of event 'cpu_atom/mem-loads,ldlat=5/P'
# Event count (approx.): 312251
#
# Overhead Data Type / Data Type Cacheline / Data Type Offset
# .............. ..................................................
#
<SNIP>
0.07% struct sigaction
0.05% struct sigaction: cache-line 1
0.02% struct sigaction +0x58 (sa_mask)
0.02% struct sigaction +0x78 (sa_mask)
0.03% struct sigaction: cache-line 0
0.02% struct sigaction +0x38 (sa_mask)
0.01% struct sigaction +0x8 (sa_mask)
<SNIP>
Signed-off-by: Namhyung Kim <[email protected]>
Tested-by: Arnaldo Carvalho de Melo <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Athira Rajeev <[email protected]>
Cc: Ian Rogers <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
It'd be better to have them in hex to check cacheline alignment.
Percent offset size field
100.00 0 0x1c0 struct cfs_rq {
0.00 0 0x10 struct load_weight load {
0.00 0 0x8 long unsigned int weight;
0.00 0x8 0x4 u32 inv_weight;
};
0.00 0x10 0x4 unsigned int nr_running;
14.56 0x14 0x4 unsigned int h_nr_running;
0.00 0x18 0x4 unsigned int idle_nr_running;
0.00 0x1c 0x4 unsigned int idle_h_nr_running;
...
Committer notes:
Justification from Namhyung when asked about why it would be "better":
Cache line sizes are power of 2 so it'd be natural to use hex and
check whether an offset is in the same boundary. Also 'perf annotate'
shows instruction offsets in hex.
>
> Maybe this should be selectable?
I can add an option and/or a config if you want.
Signed-off-by: Namhyung Kim <[email protected]>
Tested-by: Arnaldo Carvalho de Melo <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Athira Rajeev <[email protected]>
Cc: Ian Rogers <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
Some sort keys are meaningful only in a specific mode - like branch
stack and memory (data-src). Add the mode to skip unnecessary ones.
This will be used for 'perf mem report' later.
While at it, change the prefix for the -F/--fields option to remove
the duplicate part.
Before:
$ perf report -F
Error: switch `F' requires a value
Usage: perf report [<options>]
-F, --fields <key[,keys...]>
output field(s): overhead period sample overhead overhead_sys
overhead_us overhead_guest_sys overhead_guest_us overhead_children
sample period weight1 weight2 weight3 ins_lat retire_lat
...
After:
$ perf report -F
Error: switch `F' requires a value
Usage: perf report [<options>]
-F, --fields <key[,keys...]>
output field(s): overhead overhead_sys overhead_us
overhead_guest_sys overhead_guest_us overhead_children
sample period weight1 weight2 weight3 ins_lat retire_lat
...
Signed-off-by: Namhyung Kim <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Athira Rajeev <[email protected]>
Cc: Ian Rogers <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
It's expected that both hist entries are in the same hists when
comparing two. But the current code in the function checks one without
dso sort key and other with the key. This would make the condition true
in any case.
I guess the intention of the original commit was to add '!' for the
right side too. But as it should be the same, let's just remove it.
Fixes: 69849fc5d2119 ("perf hists: Move sort__has_dso into struct perf_hpp_list")
Reviewed-by: Kan Liang <[email protected]>
Signed-off-by: Namhyung Kim <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Add reference count checking and switch 'struct mem_info' usage to use
accessor functions.
Signed-off-by: Ian Rogers <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Athira Rajeev <[email protected]>
Cc: Ben Gainey <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James Clark <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Kajol Jain <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: Li Dong <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Oliver Upton <[email protected]>
Cc: Paran Lee <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ravi Bangoria <[email protected]>
Cc: Sun Haiyong <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Yanteng Si <[email protected]>
Cc: Yicong Yang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
Move mem-info to its own header rather than having it split between
mem-events and symbol.
Signed-off-by: Ian Rogers <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Athira Rajeev <[email protected]>
Cc: Ben Gainey <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James Clark <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Kajol Jain <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: Li Dong <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Oliver Upton <[email protected]>
Cc: Paran Lee <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ravi Bangoria <[email protected]>
Cc: Sun Haiyong <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Yanteng Si <[email protected]>
Cc: Yicong Yang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
Add reference count checking to struct dso, this can help with
implementing correct reference counting discipline. To avoid
RC_CHK_ACCESS everywhere, add accessor functions for the variables in
struct dso.
The majority of the change is mechanical in nature and not easy to
split up.
Committer testing:
'perf test' up to this patch shows no regressions.
But:
util/symbol.c: In function ‘dso__load_bfd_symbols’:
util/symbol.c:1683:9: error: too few arguments to function ‘dso__set_adjust_symbols’
1683 | dso__set_adjust_symbols(dso);
| ^~~~~~~~~~~~~~~~~~~~~~~
In file included from util/symbol.c:21:
util/dso.h:268:20: note: declared here
268 | static inline void dso__set_adjust_symbols(struct dso *dso, bool val)
| ^~~~~~~~~~~~~~~~~~~~~~~
make[6]: *** [/home/acme/git/perf-tools-next/tools/build/Makefile.build:106: /tmp/tmp.ZWHbQftdN6/util/symbol.o] Error 1
MKDIR /tmp/tmp.ZWHbQftdN6/tests/workloads/
make[6]: *** Waiting for unfinished jobs....
This was updated:
- symbols__fixup_end(&dso->symbols, false);
- symbols__fixup_duplicate(&dso->symbols);
- dso->adjust_symbols = 1;
+ symbols__fixup_end(dso__symbols(dso), false);
+ symbols__fixup_duplicate(dso__symbols(dso));
+ dso__set_adjust_symbols(dso);
But not build tested with BUILD_NONDISTRO and libbfd devel files installed
(binutils-devel on fedora).
Add the missing argument:
symbols__fixup_end(dso__symbols(dso), false);
symbols__fixup_duplicate(dso__symbols(dso));
- dso__set_adjust_symbols(dso);
+ dso__set_adjust_symbols(dso, true);
Signed-off-by: Ian Rogers <[email protected]>
Tested-by: Arnaldo Carvalho de Melo <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Ahelenia Ziemiańska <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Athira Rajeev <[email protected]>
Cc: Ben Gainey <[email protected]>
Cc: Changbin Du <[email protected]>
Cc: Chengen Du <[email protected]>
Cc: Colin Ian King <[email protected]>
Cc: Dima Kogan <[email protected]>
Cc: Ilkka Koskinen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James Clark <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: Leo Yan <[email protected]>
Cc: Li Dong <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Paran Lee <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Sun Haiyong <[email protected]>
Cc: Thomas Richter <[email protected]>
Cc: Tiezhu Yang <[email protected]>
Cc: Yanteng Si <[email protected]>
Cc: zhaimingbing <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
Add weight1, weight2 and weight3 fields to -F/--fields and their aliases
like 'ins_lat', 'p_stage_cyc' and 'retire_lat'. Note that they are in
the sort keys too but the difference is that output fields will sum up
the weight values and display the average.
In the sort key, users can see the distribution of weight value and I
think it's confusing we have local vs. global weight for the same weight.
For example, I experiment with mem-loads events to get the weights. On
my laptop, it seems only weight1 field is supported.
$ perf mem record -- perf test -w noploop
Let's look at the noploop function only. It has 7 samples.
$ perf script -F event,ip,sym,weight | grep noploop
# event weight ip sym
cpu/mem-loads,ldlat=30/P: 43 55b3c122bffc noploop
cpu/mem-loads,ldlat=30/P: 48 55b3c122bffc noploop
cpu/mem-loads,ldlat=30/P: 38 55b3c122bffc noploop <--- same weight
cpu/mem-loads,ldlat=30/P: 38 55b3c122bffc noploop <--- same weight
cpu/mem-loads,ldlat=30/P: 59 55b3c122bffc noploop
cpu/mem-loads,ldlat=30/P: 33 55b3c122bffc noploop
cpu/mem-loads,ldlat=30/P: 38 55b3c122bffc noploop <--- same weight
When you use the 'weight' sort key, it'd show entries with a separate
weight value separately. Also note that the first entry has 3 samples
with weight value 38, so they are displayed together and the weight
value is the sum of 3 samples (114 = 38 * 3).
$ perf report -n -s +weight | grep -e Weight -e noploop
# Overhead Samples Command Shared Object Symbol Weight
0.53% 3 perf perf [.] noploop 114
0.18% 1 perf perf [.] noploop 59
0.18% 1 perf perf [.] noploop 48
0.18% 1 perf perf [.] noploop 43
0.18% 1 perf perf [.] noploop 33
If you use 'local_weight' sort key, you can see the actual weight.
$ perf report -n -s +local_weight | grep -e Weight -e noploop
# Overhead Samples Command Shared Object Symbol Local Weight
0.53% 3 perf perf [.] noploop 38
0.18% 1 perf perf [.] noploop 59
0.18% 1 perf perf [.] noploop 48
0.18% 1 perf perf [.] noploop 43
0.18% 1 perf perf [.] noploop 33
But when you use the -F/--field option instead, you can see the average
weight for the while noploop function (as it won't group samples by
weight value and use the default 'comm,dso,sym' sort keys).
$ perf report -n -F +weight | grep -e Weight -e noploop
Warning:
--fields weight shows the average value unlike in the --sort key.
# Overhead Samples Weight1 Command Shared Object Symbol
1.23% 7 42.4 perf perf [.] noploop
The weight1 field shows the average value:
(38 * 3 + 59 + 48 + 43 + 33) / 7 = 42.4
Also it'd show the warning that 'weight' field has the average value.
Using 'weight1' can remove the warning.
Signed-off-by: Namhyung Kim <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Athira Rajeev <[email protected]>
Cc: Ian Rogers <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
Prevent a perf report segfault with the (non sensical) --no-parent
option
Signed-off-By: Andi Kleen <[email protected]>
Acked-by: Namhyung <[email protected]>
Signed-off-by: Namhyung Kim <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Support data type annotation with new --data-type option. It internally
uses type sort key to collect sample histogram for the type and display
every members like below.
$ perf annotate --data-type
...
Annotate type: 'struct cfs_rq' in [kernel.kallsyms] (13 samples):
============================================================================
samples offset size field
13 0 640 struct cfs_rq {
2 0 16 struct load_weight load {
2 0 8 unsigned long weight;
0 8 4 u32 inv_weight;
};
0 16 8 unsigned long runnable_weight;
0 24 4 unsigned int nr_running;
1 28 4 unsigned int h_nr_running;
...
For simplicity it prints the number of samples per field for now.
But it should be easy to show the overhead percentage instead.
The number at the outer struct is a sum of the numbers of the inner
members. For example, struct cfs_rq got total 13 samples, and 2 came
from the load (struct load_weight) and 1 from h_nr_running. Similarly,
the struct load_weight got total 2 samples and they all came from the
weight field.
I've added two new flags in the symbol_conf for this. The
annotate_data_member is to get the members of the type. This is also
needed for perf report with typeoff sort key. The annotate_data_sample
is to update sample stats for each offset and used only in annotate.
Currently it only support stdio output mode, TUI support can be added
later.
Committer testing:
With the perf.data from the previous csets, a very simple, short
duration one:
# perf annotate --data-type
Annotate type: 'struct list_head' in [kernel.kallsyms] (1 samples):
============================================================================
samples offset size field
1 0 16 struct list_head {
0 0 8 struct list_head* next;
1 8 8 struct list_head* prev;
};
Annotate type: 'char' in [kernel.kallsyms] (1 samples):
============================================================================
samples offset size field
1 0 1 char ;
#
Signed-off-by: Namhyung Kim <[email protected]>
Tested-by: Arnaldo Carvalho de Melo <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Ian Rogers <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
The symoff sort key is to print symbol and offset of sample. This is
useful for data type profiling to show exact instruction in the function
which refers the data.
$ perf report -s type,sym,typeoff,symoff --hierarchy
...
# Overhead Data Type / Symbol / Data Type Offset / Symbol Offset
# .............. .....................................................
#
1.23% struct cfs_rq
0.84% update_blocked_averages
0.19% struct cfs_rq +336 (leaf_cfs_rq_list.next)
0.19% [k] update_blocked_averages+0x96
0.19% struct cfs_rq +0 (load.weight)
0.14% [k] update_blocked_averages+0x104
0.04% [k] update_blocked_averages+0x31c
0.17% struct cfs_rq +404 (throttle_count)
0.12% [k] update_blocked_averages+0x9d
0.05% [k] update_blocked_averages+0x1f9
0.08% struct cfs_rq +272 (propagate)
0.07% [k] update_blocked_averages+0x3d3
0.02% [k] update_blocked_averages+0x45b
...
Committer testing:
# perf report --stdio -s type,typeoff,symoff
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 4 of event 'cpu_atom/mem-loads,ldlat=30/P'
# Event count (approx.): 7
#
# Overhead Data Type Data Type Offset Symbol Offset
# ........ ......... ................ .............
#
42.86% struct list_head struct list_head +8 (prev) [k] __list_del_entry_valid_or_report+0x7
28.57% (unknown) (unknown) +0 (no field) [.] _nl_intern_locale_data+0x25
14.29% char char +0 (no field) [k] strncpy_from_user+0xa5
14.29% (unknown) (unknown) +0 (no field) [.] _dl_lookup_symbol_x+0x50
#
# (Tip: To change sampling frequency to 100 Hz: perf record -F 100)
#
Signed-off-by: Namhyung Kim <[email protected]>
Tested-by: Arnaldo Carvalho de Melo <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Ian Rogers <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
The typeoff sort key shows the data type name, offset and the name of
the field. This is useful to see which field in the struct is accessed
most frequently.
$ perf report -s type,typeoff --hierarchy --stdio
...
# Overhead Data Type / Data Type Offset
# ............ ............................
#
...
1.23% struct cfs_rq
0.19% struct cfs_rq +404 (throttle_count)
0.19% struct cfs_rq +0 (load.weight)
0.19% struct cfs_rq +336 (leaf_cfs_rq_list.next)
0.09% struct cfs_rq +272 (propagate)
0.09% struct cfs_rq +196 (removed.nr)
0.09% struct cfs_rq +80 (curr)
0.09% struct cfs_rq +544 (lt_b_children_throttled)
0.06% struct cfs_rq +320 (rq)
Committer testing:
Again with the perf.data from the previous csets:
# perf report --stdio -s type,typeoff
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 4 of event 'cpu_atom/mem-loads,ldlat=30/P'
# Event count (approx.): 7
#
# Overhead Data Type Data Type Offset
# ........ ......... ................
#
42.86% struct list_head struct list_head +8 (prev)
42.86% (unknown) (unknown) +0 (no field)
14.29% char char +0 (no field)
#
# (Tip: To see callchains in a more compact form: perf report -g folded)
#
# perf report --stdio -s dso,type,typeoff
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 4 of event 'cpu_atom/mem-loads,ldlat=30/P'
# Event count (approx.): 7
#
# Overhead Shared Object Data Type Data Type Offset
# ........ .................... ......... ................
#
42.86% [kernel.kallsyms] struct list_head struct list_head +8 (prev)
28.57% libc.so.6 (unknown) (unknown) +0 (no field)
14.29% [kernel.kallsyms] char char +0 (no field)
14.29% ld-linux-x86-64.so.2 (unknown) (unknown) +0 (no field)
#
# (Tip: If you have debuginfo enabled, try: perf report -s sym,srcline)
#
#
Signed-off-by: Namhyung Kim <[email protected]>
Tested-by: Arnaldo Carvalho de Melo <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Ian Rogers <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
Add child member field if the current type is a composite type like a
struct or union. The member fields are linked in the children list and
do the same recursively if the child itself is a composite type.
Add 'self' member to the annotated_data_type to handle the members in
the same way.
Signed-off-by: Namhyung Kim <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Ian Rogers <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
The 'type' sort key is to aggregate hist entries by data type they
access. Add mem_type field to hist_entry struct to save the type. If
hist_entry__get_data_type() returns NULL, it'd use the 'unknown_type'
instance.
Committer testing:
Before:
# perf mem record sleep 2s
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.037 MB perf.data (4 samples) ]
root@number:/home/acme/Downloads# perf report --stdio -s type
Error:
Unknown --sort key: `type'
Usage: perf report [<options>]
-s, --sort <key[,key2...]>
sort by key(s): overhead overhead_sys overhead_us overhead_guest_sys
overhead_guest_us overhead_children sample period
pid comm dso symbol parent cpu socket srcline srcfile
local_weight weight transaction trace symbol_size
dso_size cgroup cgroup_id ipc_null time code_page_size
local_ins_lat ins_lat local_p_stage_cyc p_stage_cyc
addr local_retire_lat retire_lat simd dso_from dso_to
symbol_from symbol_to mispredict abort in_tx cycles
srcline_from srcline_to ipc_lbr addr_from addr_to
symbol_daddr dso_daddr locked tlb mem snoop dcacheline
symbol_iaddr phys_daddr data_page_size blocked
#
After:
# perf report --stdio -s type
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 4 of event 'cpu_atom/mem-loads,ldlat=30/P'
# Event count (approx.): 7
#
# Overhead Data Type
# ........ .........
#
100.00% (unknown)
#
# (Tip: Print event counts in CSV format with: perf stat -x,)
#
# rpm -q kernel-debuginfo
kernel-debuginfo-6.6.4-200.fc39.x86_64
# uname -r
6.6.4-200.fc39.x86_64
#
Signed-off-by: Namhyung Kim <[email protected]>
Tested-by: Arnaldo Carvalho de Melo <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Ian Rogers <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: [email protected]>
Cc: [email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
The cycles info is only meaningful when sample has branch stacks. To
save the memory for normal cases, move those fields to a new 'struct
annotated_branch' and dynamically allocate it when needed. Also move
cycles_hist from annotated_source as it's related here.
Signed-off-by: Namhyung Kim <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Christophe JAILLET <[email protected]>
Cc: Ian Rogers <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
Comparing pointers with reference count checking is tricky to avoid a
SEGV. Add a convenience macro to simplify and use.
Signed-off-by: Ian Rogers <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Ravi Bangoria <[email protected]>
Cc: Sandipan Das <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: German Gomez <[email protected]>
Cc: James Clark <[email protected]>
Cc: Nick Terrell <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: Changbin Du <[email protected]>
Cc: liuwenyu <[email protected]>
Cc: Yang Jihong <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Miguel Ojeda <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Leo Yan <[email protected]>
Cc: Kajol Jain <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: Athira Rajeev <[email protected]>
Cc: Yanteng Si <[email protected]>
Cc: Liam Howlett <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Namhyung Kim <[email protected]>
|
|
This is a string constant that gets returned and then strcmp() around,
we can instead just do a pointer comparision.
That requires a new global variable to comply with these warnings from
some versions of clang and gcc:
41 68.95 fedora:rawhide : FAIL clang version 16.0.4 (Fedora 16.0.4-1.fc39)
result of comparison against a string literal is unspecified (use an explicit string comparison function instead) [-Werror,-Wstring-compare]
if (start_line != SRCLINE_UNKNOWN &&
^ ~~~~~~~~~~~~~~~ 41
Ack comments:
Agreed, the strcmps make me nervous as they won't distinguish heap from
a global meaning we could end up with things like pointers to freed
memory. The comparison with the global is always going to be same imo.
Acked-by: Ian Rogers <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Ali Saidi <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Athira Rajeev <[email protected]>
Cc: Brian Robbins <[email protected]>
Cc: Changbin Du <[email protected]>
Cc: Dmitrii Dolgov <[email protected]>
Cc: Fangrui Song <[email protected]>
Cc: German Gomez <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Ivan Babrou <[email protected]>
Cc: James Clark <[email protected]>
Cc: Jing Zhang <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: John Garry <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: Leo Yan <[email protected]>
Cc: Liam Howlett <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Miguel Ojeda <[email protected]>
Cc: Mike Leach <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Naveen N. Rao <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ravi Bangoria <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: Steinar H. Gunderson <[email protected]>
Cc: Suzuki Poulouse <[email protected]>
Cc: Wenyu Liu <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yang Jihong <[email protected]>
Cc: Ye Xingchen <[email protected]>
Cc: Yuan Can <[email protected]>
Link: https://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
Modify struct declaration and accessor functions for the reference
count checkers additional layer of indirection. Make sure pid_cmp in
builtin-sched.c uses the underlying/original struct in pointer
arithmetic, and not the temporary get/put indirection.
Signed-off-by: Ian Rogers <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Ali Saidi <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Athira Rajeev <[email protected]>
Cc: Brian Robbins <[email protected]>
Cc: Changbin Du <[email protected]>
Cc: Dmitrii Dolgov <[email protected]>
Cc: Fangrui Song <[email protected]>
Cc: German Gomez <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Ivan Babrou <[email protected]>
Cc: James Clark <[email protected]>
Cc: Jing Zhang <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: John Garry <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: Leo Yan <[email protected]>
Cc: Liam Howlett <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Miguel Ojeda <[email protected]>
Cc: Mike Leach <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Naveen N. Rao <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ravi Bangoria <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: Steinar H. Gunderson <[email protected]>
Cc: Suzuki Poulouse <[email protected]>
Cc: Wenyu Liu <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yang Jihong <[email protected]>
Cc: Ye Xingchen <[email protected]>
Cc: Yuan Can <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
Using accessors will make it easier to add reference count checking in
later patches.
Committer notes:
thread->nsinfo wasn't wrapped as it is used together with
nsinfo__zput(), where does a trick to set the field with a refcount
being dropped to NULL, and that doesn't work well with using
thread__nsinfo(thread), that loses the &thread->nsinfo pointer.
When refcount checking is added to 'struct thread', later in this
series, nsinfo__zput(RC_CHK_ACCESS(thread)->nsinfo) will be used to
check the thread pointer.
Signed-off-by: Ian Rogers <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Ali Saidi <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Athira Rajeev <[email protected]>
Cc: Brian Robbins <[email protected]>
Cc: Changbin Du <[email protected]>
Cc: Dmitrii Dolgov <[email protected]>
Cc: Fangrui Song <[email protected]>
Cc: German Gomez <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Ivan Babrou <[email protected]>
Cc: James Clark <[email protected]>
Cc: Jing Zhang <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: John Garry <[email protected]>
Cc: K Prateek Nayak <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: Leo Yan <[email protected]>
Cc: Liam Howlett <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Miguel Ojeda <[email protected]>
Cc: Mike Leach <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Naveen N. Rao <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ravi Bangoria <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: Steinar H. Gunderson <[email protected]>
Cc: Suzuki Poulouse <[email protected]>
Cc: Wenyu Liu <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yang Jihong <[email protected]>
Cc: Ye Xingchen <[email protected]>
Cc: Yuan Can <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
equal to a given string
This makes the logic a bit clear by avoiding the !strcmp() pattern and
also a way to intercept the pointer if we need to do extra validation on
it or to do lazy setting of evsel->name via evsel__name(evsel).
Reviewed-by: "Liang, Kan" <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Ian Rogers <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Link: https://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
James Clark noticed that the recent 63df0e4bc368adbd ("perf map: Add
accessor for dso") patch accessed map->dso before the 'map' variable was
NULL checked, which is a change in logic that leads to segmentation
faults, so comb thru that patch to fix similar cases.
Fixes: 63df0e4bc368adbd ("perf map: Add accessor for dso")
Acked-by: Ian Rogers <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Ian Rogers <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James Clark <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: John Garry <[email protected]>
Cc: Leo Yan <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Mathieu Poirier <[email protected]>
Cc: Mike Leach <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]
Cc: Suzuki Poulouse <[email protected]>
Cc: Will Deacon <[email protected]>
Link: https://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
sort__sym_from_cmp()
Addresses of two data structure members were determined before
corresponding null pointer checks in the implementation of the function
“sort__sym_from_cmp”.
Thus avoid the risk for undefined behaviour by removing extra
initialisations for the local variables “from_l” and “from_r” (also
because they were already reassigned with the same value behind this
pointer check).
This issue was detected by using the Coccinelle software.
Fixes: 1b9e97a2a95e4941 ("perf tools: Fix report -F symbol_from for data without branch info")
Signed-off-by: <[email protected]>
Acked-by: Ian Rogers <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: German Gomez <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Namhyung Kim <[email protected]>
Link: https://lore.kernel.org/cocci/[email protected]/
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
Later changes will add reference count checking for 'struct map'. Add an
accessor so that the reference count check is only necessary in one
place.
Signed-off-by: Ian Rogers <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Alexey Bayduraev <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Darren Hart <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Dmitriy Vyukov <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: German Gomez <[email protected]>
Cc: Hao Luo <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James Clark <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: John Garry <[email protected]>
Cc: Kajol Jain <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: Leo Yan <[email protected]>
Cc: Madhavan Srinivasan <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Miaoqian Lin <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Riccardo Mancini <[email protected]>
Cc: Shunsuke Nakamura <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Stephen Brennan <[email protected]>
Cc: Steven Rostedt (VMware) <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Thomas Richter <[email protected]>
Cc: Yury Norov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
Later changes will add reference count checking for struct map, add a
helper function to invoke the map_ip and unmap_ip function pointers. The
helper allows the reference count check to be in fewer places.
Committer notes:
Add missing conversions to:
tools/perf/util/map.c
tools/perf/util/cs-etm.c
tools/perf/util/annotate.c
tools/perf/arch/powerpc/util/sym-handling.c
tools/perf/arch/s390/annotate/instructions.c
Signed-off-by: Ian Rogers <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Alexey Bayduraev <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Darren Hart <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Dmitriy Vyukov <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: German Gomez <[email protected]>
Cc: Hao Luo <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James Clark <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: John Garry <[email protected]>
Cc: Kajol Jain <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: Leo Yan <[email protected]>
Cc: Madhavan Srinivasan <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Miaoqian Lin <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Riccardo Mancini <[email protected]>
Cc: Shunsuke Nakamura <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Stephen Brennan <[email protected]>
Cc: Steven Rostedt (VMware) <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Thomas Richter <[email protected]>
Cc: Yury Norov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
Later changes will add reference count checking for struct map, with
dso being the most frequently accessed variable. Add an accessor so
that the reference count check is only necessary in one place.
Additional changes:
- add a dso variable to avoid repeated map__dso calls.
- in builtin-mem.c dump_raw_samples, code only partially tested for
dso == NULL. Make the possibility of NULL consistent.
- in thread.c thread__memcpy fix use of spaces and use tabs.
Committer notes:
Did missing conversions on these files:
tools/perf/arch/powerpc/util/skip-callchain-idx.c
tools/perf/arch/powerpc/util/sym-handling.c
tools/perf/ui/browsers/hists.c
tools/perf/ui/gtk/annotate.c
tools/perf/util/cs-etm.c
tools/perf/util/thread.c
tools/perf/util/unwind-libunwind-local.c
tools/perf/util/unwind-libunwind.c
Signed-off-by: Ian Rogers <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Alexey Bayduraev <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Darren Hart <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Dmitriy Vyukov <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: German Gomez <[email protected]>
Cc: Hao Luo <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James Clark <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: John Garry <[email protected]>
Cc: Kajol Jain <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: Leo Yan <[email protected]>
Cc: Madhavan Srinivasan <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Miaoqian Lin <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Riccardo Mancini <[email protected]>
Cc: Shunsuke Nakamura <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Stephen Brennan <[email protected]>
Cc: Steven Rostedt (VMware) <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Thomas Richter <[email protected]>
Cc: Yury Norov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
Introduce functions to access struct maps. These functions reduce the
number of places reference counting is necessary. While tidying APIs do
some small const-ification, in particlar to unwind_libunwind_ops.
Committer notes:
Fixed up tools/perf/util/unwind-libunwind.c:
- return ops->get_entries(cb, arg, thread, data, max_stack);
+ return ops->get_entries(cb, arg, thread, data, max_stack, best_effort);
Signed-off-by: Ian Rogers <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Alexey Bayduraev <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Darren Hart <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Dmitriy Vyukov <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: German Gomez <[email protected]>
Cc: Hao Luo <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James Clark <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: John Garry <[email protected]>
Cc: Kajol Jain <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: Leo Yan <[email protected]>
Cc: Madhavan Srinivasan <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Miaoqian Lin <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Riccardo Mancini <[email protected]>
Cc: Shunsuke Nakamura <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Stephen Brennan <[email protected]>
Cc: Steven Rostedt (VMware) <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Thomas Richter <[email protected]>
Cc: Yury Norov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
The earlier commit f0cdde28fecc0d7f ("perf hist: Improve srcfile sort
key performance") updated the srcfile logic but missed to change the
->cmp() callback which is called for every sample.
It should use the same logic like in the srcline to speed up the
processing because it'd return the same information repeatedly for the
same address. The real processing will be done in
sort__srcfile_collapse().
Fixes: f0cdde28fecc0d7f ("perf hist: Improve srcfile sort key performance")
Signed-off-by: Namhyung Kim <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Ian Rogers <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
Add 'simd' sort field to visualize SIMD ops in 'perf report'.
Rows are labeled with the SIMD ISA, and the type of predicate (if any):
- [p] partial predicate
- [e] empty predicate (no elements in the vector being used)
Example with Arm SPE and SVE (Scalable Vector Extension):
#include <arm_sve.h>
double src[1025], dst[1025];
int main(void) {
svfloat64_t vc = svdup_f64(1);
for(;;)
for(int i = 0; i < 1025; i += svcntd())
{
svbool_t pg = svwhilelt_b64(i, 1025);
svfloat64_t vsrc = svld1(pg, &src[i]);
svfloat64_t vdst = svadd_x(pg, vsrc, vc);
svst1(pg, &dst[i], vdst);
}
return 0;
}
... compiled using "gcc-11 -march=armv8-a+sve -O3"
Profiling on a platform that implements FEAT_SVE and FEAT_SPEv1p1:
$ perf record -e arm_spe_0// -- ./a.out
$ perf report --itrace=i1i -s overhead,pid,simd,sym
Overhead Pid:Command Simd Symbol
........ ................ ....... ......................
53.76% 10758:program [.] main
46.14% 10758:program [.] SVE [.] main
0.09% 10758:program [p] SVE [.] main
The report shows 0.09% of the sampled SVE operations use partial
predicates due to src and dst arrays not being multiples of the vector
register lengths.
Signed-off-by: German Gomez <[email protected]>
Acked-by: Ian Rogers <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: [email protected]
Cc: Ingo Molnar <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: John Garry <[email protected]>
Cc: Leo Yan <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Mike Leach <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: James Clark <[email protected]>
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
Many platforms have feature of adjacent cachelines prefetch, when it is
enabled, for data in RAM of 2 cachelines (2N and 2N+1) granularity, if
one is fetched to cache, the other one could likely be fetched too,
which sort of extends the cacheline size to double, thus the false
sharing could happens in adjacent cachelines.
0Day has captured performance changed related with this [1], and some
commercial software explicitly makes its hot global variables 128 bytes
aligned (2 cache lines) to avoid this kind of extended false sharing.
So add an option "--double-cl" for 'perf c2c report' to show false
sharing in double cache line granularity, which acts just like the
cacheline size is doubled. There is no change to c2c record. The
hardware events of shared cacheline are still per cacheline, and this
option just changes the granularity of how events are grouped and
displayed.
In the 'perf c2c report' output below (will-it-scale's 'pagefault2' case
on old kernel):
----------------------------------------------------------------------
26 31 2 0 0 0 0xffff888103ec6000
----------------------------------------------------------------------
35.48% 50.00% 0.00% 0.00% 0.00% 0x10 0 1 0xffffffff8133148b 1153 66 971 3748 74 [k] get_mem_cgroup_from_mm
6.45% 0.00% 0.00% 0.00% 0.00% 0x10 0 1 0xffffffff813396e4 570 0 1531 879 75 [k] mem_cgroup_charge
25.81% 50.00% 0.00% 0.00% 0.00% 0x54 0 1 0xffffffff81331472 949 70 593 3359 74 [k] get_mem_cgroup_from_mm
19.35% 0.00% 0.00% 0.00% 0.00% 0x54 0 1 0xffffffff81339686 1352 0 1073 1022 74 [k] mem_cgroup_charge
9.68% 0.00% 0.00% 0.00% 0.00% 0x54 0 1 0xffffffff813396d6 1401 0 863 768 74 [k] mem_cgroup_charge
3.23% 0.00% 0.00% 0.00% 0.00% 0x54 0 1 0xffffffff81333106 618 0 804 11 9 [k] uncharge_batch
The offset 0x10 and 0x54 used to displayed in 2 groups, and now they are
listed together to give users a hint of extended false sharing.
[1]. https://lore.kernel.org/lkml/20201102091543.GM31092@shao2-debian/
Committer notes:
Link: https://lore.kernel.org/r/Y+wvVNWqXb70l4uy@feng-clx
Removed -a, leaving just as --double-cl, as this probably is not used so
frequently and perhaps will be even auto-detected if we manage to record
the MSR where this is configured.
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Leo Yan <[email protected]>
Signed-off-by: Feng Tang <[email protected]>
Tested-by: Leo Yan <[email protected]>
Acked-by: Joe Mario <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Xing Zhengjun <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
The Retire Latency field is added in the var3_w of the
PERF_SAMPLE_WEIGHT_STRUCT. The Retire Latency reports pipeline stall of
this instruction compared to the previous instruction in cycles. That's
quite useful to display the information with perf mem report.
The p_stage_cyc for Power is also from the var3_w. Union the p_stage_cyc
and retire_lat to share the code.
Implement X86 specific codes to display the X86 specific header.
Add a new sort key retire_lat for the Retire Latency.
Reviewed-by: Andi Kleen <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Cc: Ian Rogers <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Link: http://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
Add a helper function that applies the mask to test, or returns false
if libtraceevent is too old or not present.
Signed-off-by: Ian Rogers <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Athira Jajeev <[email protected]>
Cc: Eelco Chaudron <[email protected]>
Cc: German Gomez <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: Kim Phillips <[email protected]>
Cc: Leo Yan <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Yang Jihong <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
Switch HAVE_LIBTRACEEVENT_TEP_FIELD_IS_RELATIVE to be a version number
test on libtraceevent being >= to version 1.5.0. This also corrects a
greater-than test to be greater-than-or-equal.
Fixes: b9a49f8cb02f0859 ("perf tools: Check if libtracevent has TEP_FIELD_IS_RELATIVE")
Signed-off-by: Ian Rogers <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Athira Jajeev <[email protected]>
Cc: Eelco Chaudron <[email protected]>
Cc: German Gomez <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: Kim Phillips <[email protected]>
Cc: Leo Yan <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Yang Jihong <[email protected]>
Link: https://lore.kernel.org/lkml/[email protected]/
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
not linked with libtraceevent
When we have a perf.data file with tracepoints, such as:
# perf evlist -f
probe_perf:lzma_decompress_to_file
# Tip: use 'perf evlist --trace-fields' to show fields for tracepoint events
#
We end up segfaulting when using perf built with NO_LIBTRACEEVENT=1 by
trying to find an evsel with a NULL 'event_name' variable:
(gdb) run report --stdio -f
Starting program: /root/bin/perf report --stdio -f
Program received signal SIGSEGV, Segmentation fault.
0x000000000055219d in find_evsel (evlist=0xfda7b0, event_name=0x0) at util/sort.c:2830
warning: Source file is more recent than executable.
2830 if (event_name[0] == '%') {
Missing separate debuginfos, use: dnf debuginfo-install bzip2-libs-1.0.8-11.fc36.x86_64 cyrus-sasl-lib-2.1.27-18.fc36.x86_64 elfutils-debuginfod-client-0.188-3.fc36.x86_64 elfutils-libelf-0.188-3.fc36.x86_64 elfutils-libs-0.188-3.fc36.x86_64 glibc-2.35-20.fc36.x86_64 keyutils-libs-1.6.1-4.fc36.x86_64 krb5-libs-1.19.2-12.fc36.x86_64 libbrotli-1.0.9-7.fc36.x86_64 libcap-2.48-4.fc36.x86_64 libcom_err-1.46.5-2.fc36.x86_64 libcurl-7.82.0-12.fc36.x86_64 libevent-2.1.12-6.fc36.x86_64 libgcc-12.2.1-4.fc36.x86_64 libidn2-2.3.4-1.fc36.x86_64 libnghttp2-1.51.0-1.fc36.x86_64 libpsl-0.21.1-5.fc36.x86_64 libselinux-3.3-4.fc36.x86_64 libssh-0.9.6-4.fc36.x86_64 libstdc++-12.2.1-4.fc36.x86_64 libunistring-1.0-1.fc36.x86_64 libunwind-1.6.2-2.fc36.x86_64 libxcrypt-4.4.33-4.fc36.x86_64 libzstd-1.5.2-2.fc36.x86_64 numactl-libs-2.0.14-5.fc36.x86_64 opencsd-1.2.0-1.fc36.x86_64 openldap-2.6.3-1.fc36.x86_64 openssl-libs-3.0.5-2.fc36.x86_64 slang-2.3.2-11.fc36.x86_64 xz-libs-5.2.5-9.fc36.x86_64 zlib-1.2.11-33.fc36.x86_64
(gdb) bt
#0 0x000000000055219d in find_evsel (evlist=0xfda7b0, event_name=0x0) at util/sort.c:2830
#1 0x0000000000552416 in add_dynamic_entry (evlist=0xfda7b0, tok=0xffb6eb "trace", level=2) at util/sort.c:2976
#2 0x0000000000552d26 in sort_dimension__add (list=0xf93e00 <perf_hpp_list>, tok=0xffb6eb "trace", evlist=0xfda7b0, level=2) at util/sort.c:3193
#3 0x0000000000552e1c in setup_sort_list (list=0xf93e00 <perf_hpp_list>, str=0xffb6eb "trace", evlist=0xfda7b0) at util/sort.c:3227
#4 0x00000000005532fa in __setup_sorting (evlist=0xfda7b0) at util/sort.c:3381
#5 0x0000000000553cdc in setup_sorting (evlist=0xfda7b0) at util/sort.c:3608
#6 0x000000000042eb9f in cmd_report (argc=0, argv=0x7fffffffe470) at builtin-report.c:1596
#7 0x00000000004aee7e in run_builtin (p=0xf64ca0 <commands+288>, argc=3, argv=0x7fffffffe470) at perf.c:330
#8 0x00000000004af0f2 in handle_internal_command (argc=3, argv=0x7fffffffe470) at perf.c:384
#9 0x00000000004af241 in run_argv (argcp=0x7fffffffe29c, argv=0x7fffffffe290) at perf.c:428
#10 0x00000000004af5fc in main (argc=3, argv=0x7fffffffe470) at perf.c:562
(gdb)
So check if we have tracepoint events in add_dynamic_entry() and bail
out instead:
# perf report --stdio -f
This perf binary isn't linked with libtraceevent, can't process probe_perf:lzma_decompress_to_file
Error:
Unknown --sort key: `trace'
#
Fixes: 378ef0f5d9d7f465 ("perf build: Use libtraceevent from the system")
Acked-by: Ian Rogers <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Link: http://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
Likewise, modify ->cmp() callback to compare sample address and map
address. And add ->collapse() and ->sort() to check the actual
srcfile string. Also add ->init() to make sure it has the srcfile.
Signed-off-by: Namhyung Kim <[email protected]>
Acked-by: Ian Rogers <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Leo Yan <[email protected]>
Cc: Milian Wolff <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
Likewise, modify ->cmp() callback to compare sample address and map
address. And add ->collapse() and ->sort() to check the actual
srcfile string. Also add ->init() to make sure it has the srcfile.
Signed-off-by: Namhyung Kim <[email protected]>
Acked-by: Ian Rogers <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Leo Yan <[email protected]>
Cc: Milian Wolff <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
The sort_entry->cmp() will be called for eventy sample data to find a
matching entry. When it has 'srcline' sort key, that means it needs to
call addr2line or libbfd everytime.
This is not optimal because many samples will have same address and it
just can call addr2line once. So postpone the actual srcline check to
the sort_entry->collpase() and compare addresses in ->cmp().
Also it needs to add ->init() callback to make sure it has srcline info.
If a sample has a unique data, chances are the entry can be sorted out
by other (previous) keys and callbacks in sort_srcline never called.
Signed-off-by: Namhyung Kim <[email protected]>
Acked-by: Ian Rogers <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Leo Yan <[email protected]>
Cc: Milian Wolff <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
In __hists__insert_output_entry(), it calls fmt->sort() for dynamic
entries with NULL to update column width for tracepoint fields.
But it's a hacky abuse of the sort callback, better to have a proper
callback for that. I'll add more use cases later.
Signed-off-by: Namhyung Kim <[email protected]>
Acked-by: Ian Rogers <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Leo Yan <[email protected]>
Cc: Milian Wolff <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
Some distros have older versions of libtraceevent where
TEP_FIELD_IS_RELATIVE and its associated semantics are not present, so
we need to check if the version has it, it was introduced in
libtraceevent 1.5.0.
Reported-by: Athira Jajeev <[email protected]>
Tested-by: Athira Jajeev <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Ian Rogers <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Nick Desaulniers <[email protected]>
Cc: Peter Zijlstra <[email protected]>,
Cc: Stephane Eranian <[email protected]>
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
Remove the LIBTRACEEVENT_DYNAMIC and LIBTRACEFS_DYNAMIC make command
line variables.
If libtraceevent isn't installed or NO_LIBTRACEEVENT=1 is passed to the
build, don't compile in libtraceevent and libtracefs support.
This also disables CONFIG_TRACE that controls "perf trace".
CONFIG_LIBTRACEEVENT is used to control enablement in Build/Makefiles,
HAVE_LIBTRACEEVENT is used in C code.
Without HAVE_LIBTRACEEVENT tracepoints are disabled and as such the
commands kmem, kwork, lock, sched and timechart are removed. The
majority of commands continue to work including "perf test".
Committer notes:
Fixed up a tools/perf/util/Build reject and added:
#include <traceevent/event-parse.h>
to tools/perf/util/scripting-engines/trace-event-perl.c.
Committer testing:
$ rpm -qi libtraceevent-devel
Name : libtraceevent-devel
Version : 1.5.3
Release : 2.fc36
Architecture: x86_64
Install Date: Mon 25 Jul 2022 03:20:19 PM -03
Group : Unspecified
Size : 27728
License : LGPLv2+ and GPLv2+
Signature : RSA/SHA256, Fri 15 Apr 2022 02:11:58 PM -03, Key ID 999f7cbf38ab71f4
Source RPM : libtraceevent-1.5.3-2.fc36.src.rpm
Build Date : Fri 15 Apr 2022 10:57:01 AM -03
Build Host : buildvm-x86-05.iad2.fedoraproject.org
Packager : Fedora Project
Vendor : Fedora Project
URL : https://git.kernel.org/pub/scm/libs/libtrace/libtraceevent.git/
Bug URL : https://bugz.fedoraproject.org/libtraceevent
Summary : Development headers of libtraceevent
Description :
Development headers of libtraceevent-libs
$
Default build:
$ ldd ~/bin/perf | grep tracee
libtraceevent.so.1 => /lib64/libtraceevent.so.1 (0x00007f1dcaf8f000)
$
# perf trace -e sched:* --max-events 10
0.000 migration/0/17 sched:sched_migrate_task(comm: "", pid: 1603763 (perf), prio: 120, dest_cpu: 1)
0.005 migration/0/17 sched:sched_wake_idle_without_ipi(cpu: 1)
0.011 migration/0/17 sched:sched_switch(prev_comm: "", prev_pid: 17 (migration/0), prev_state: 1, next_comm: "", next_prio: 120)
1.173 :0/0 sched:sched_wakeup(comm: "", pid: 3138 (gnome-terminal-), prio: 120)
1.180 :0/0 sched:sched_switch(prev_comm: "", prev_prio: 120, next_comm: "", next_pid: 3138 (gnome-terminal-), next_prio: 120)
0.156 migration/1/21 sched:sched_migrate_task(comm: "", pid: 1603763 (perf), prio: 120, orig_cpu: 1, dest_cpu: 2)
0.160 migration/1/21 sched:sched_wake_idle_without_ipi(cpu: 2)
0.166 migration/1/21 sched:sched_switch(prev_comm: "", prev_pid: 21 (migration/1), prev_state: 1, next_comm: "", next_prio: 120)
1.183 :0/0 sched:sched_wakeup(comm: "", pid: 1602985 (kworker/u16:0-f), prio: 120, target_cpu: 1)
1.186 :0/0 sched:sched_switch(prev_comm: "", prev_prio: 120, next_comm: "", next_pid: 1602985 (kworker/u16:0-f), next_prio: 120)
#
Had to tweak tools/perf/util/setup.py to make sure the python binding
shared object links with libtraceevent if -DHAVE_LIBTRACEEVENT is
present in CFLAGS.
Building with NO_LIBTRACEEVENT=1 uncovered some more build failures:
- Make building of data-convert-bt.c to CONFIG_LIBTRACEEVENT=y
- perf-$(CONFIG_LIBTRACEEVENT) += scripts/
- bpf_kwork.o needs also to be dependent on CONFIG_LIBTRACEEVENT=y
- The python binding needed some fixups and util/trace-event.c can't be
built and linked with the python binding shared object, so remove it
in tools/perf/util/setup.py and exclude it from the list of
dependencies in the python/perf.so Makefile.perf target.
Building without libtraceevent-devel installed uncovered more build
failures:
- The python binding tools/perf/util/python.c was assuming that
traceevent/parse-events.h was always available, which was the case
when we defaulted to using the in-kernel tools/lib/traceevent/ files,
now we need to enclose it under ifdef HAVE_LIBTRACEEVENT, just like
the other parts of it that deal with tracepoints.
- We have to ifdef the rules in the Build files with
CONFIG_LIBTRACEEVENT=y to build builtin-trace.c and
tools/perf/trace/beauty/ as we only ifdef setting CONFIG_TRACE=y when
setting NO_LIBTRACEEVENT=1 in the make command line, not when we don't
detect libtraceevent-devel installed in the system. Simplification here
to avoid these two ways of disabling builtin-trace.c and not having
CONFIG_TRACE=y when libtraceevent-devel isn't installed is the clean
way.
From Athira:
<quote>
tools/perf/arch/powerpc/util/Build
-perf-y += kvm-stat.o
+perf-$(CONFIG_LIBTRACEEVENT) += kvm-stat.o
</quote>
Then, ditto for arm64 and s390, detected by container cross build tests.
- s/390 uses test__checkevent_tracepoint() that is now only available if
HAVE_LIBTRACEEVENT is defined, enclose the callsite with ifder HAVE_LIBTRACEEVENT.
Also from Athira:
<quote>
With this change, I could successfully compile in these environment:
- Without libtraceevent-devel installed
- With libtraceevent-devel installed
- With “make NO_LIBTRACEEVENT=1”
</quote>
Then, finally rename CONFIG_TRACEEVENT to CONFIG_LIBTRACEEVENT for
consistency with other libraries detected in tools/perf/.
Signed-off-by: Ian Rogers <[email protected]>
Tested-by: Arnaldo Carvalho de Melo <[email protected]>
Tested-by: Athira Rajeev <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Nick Desaulniers <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: [email protected]
Link: http://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
Sometimes users want to see actual (virtual) address of sampled instructions.
Add a new 'addr' sort key to display the raw addresses.
$ perf record -o- true | perf report -i- -s addr
# To display the perf.data header info, please use --header/--header-only options.
#
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.000 MB - ]
#
# Total Lost Samples: 0
#
# Samples: 12 of event 'cycles:u'
# Event count (approx.): 252512
#
# Overhead Address
# ........ ..................
#
42.96% 0x7f96f08443d7
29.55% 0x7f96f0859b50
14.76% 0x7f96f0852e02
8.30% 0x7f96f0855028
4.43% 0xffffffff8de01087
Note that it just compares and displays the sample ip. Each process can
have a different memory layout and the ip will be different even if they run
the same binary. So this sort key is mostly meaningful for per-process
profile data.
Signed-off-by: Namhyung Kim <[email protected]>
Acked-by: Ian Rogers <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
With the existing symbol_from/symbol_to, branches captured in the same
function would be collapsed into a single function if the latencies
associated with the each branch (cycles) were all the same. That is the
case on Intel Broadwell, for instance. Since Intel Skylake, the latency
is captured by hardware and therefore is used to disambiguate branches.
Add addr_from/addr_to sort dimensions to sort branches based on their
addresses and not the function there are in. The output is still the
function name but the offset within the function is provided to uniquely
identify each branch. These new sort dimensions also help with annotate
because they create different entries in the histogram which, in turn,
generates proper branch annotations.
Here is an example using AMD's branch sampling:
$ perf record -a -b -c 1000037 -e cpu/branch-brs/ test_prg
$ perf report
Samples: 6M of event 'cpu/branch-brs/', Event count (approx.): 6901276
Overhead Command Source Shared Object Source Symbol Target Symbol Basic Block Cycle
99.65% test_prg test_prg [.] test_thread [.] test_thread -
0.02% test_prg [kernel.vmlinux] [k] asm_sysvec_apic_timer_interrupt [k] error_entry -
$ perf report -F overhead,comm,dso,addr_from,addr_to
Samples: 6M of event 'cpu/branch-brs/', Event count (approx.): 6901276
Overhead Command Shared Object Source Address Target Address
4.22% test_prg test_prg [.] test_thread+0x3c [.] test_thread+0x4
4.13% test_prg test_prg [.] test_thread+0x4 [.] test_thread+0x3a
4.09% test_prg test_prg [.] test_thread+0x3a [.] test_thread+0x6
4.08% test_prg test_prg [.] test_thread+0x2 [.] test_thread+0x3c
4.06% test_prg test_prg [.] test_thread+0x3e [.] test_thread+0x2
3.87% test_prg test_prg [.] test_thread+0x6 [.] test_thread+0x38
3.84% test_prg test_prg [.] test_thread [.] test_thread+0x3e
3.76% test_prg test_prg [.] test_thread+0x1e [.] test_thread
3.76% test_prg test_prg [.] test_thread+0x38 [.] test_thread+0x8
3.56% test_prg test_prg [.] test_thread+0x22 [.] test_thread+0x1e
3.54% test_prg test_prg [.] test_thread+0x8 [.] test_thread+0x36
3.47% test_prg test_prg [.] test_thread+0x1c [.] test_thread+0x22
3.45% test_prg test_prg [.] test_thread+0x36 [.] test_thread+0xa
3.28% test_prg test_prg [.] test_thread+0x24 [.] test_thread+0x1c
3.25% test_prg test_prg [.] test_thread+0xa [.] test_thread+0x34
3.24% test_prg test_prg [.] test_thread+0x1a [.] test_thread+0x24
3.20% test_prg test_prg [.] test_thread+0x34 [.] test_thread+0xc
3.04% test_prg test_prg [.] test_thread+0x26 [.] test_thread+0x1a
3.01% test_prg test_prg [.] test_thread+0xc [.] test_thread+0x32
2.98% test_prg test_prg [.] test_thread+0x18 [.] test_thread+0x26
2.94% test_prg test_prg [.] test_thread+0x32 [.] test_thread+0xe
2.76% test_prg test_prg [.] test_thread+0x28 [.] test_thread+0x18
2.73% test_prg test_prg [.] test_thread+0xe [.] test_thread+0x30
2.67% test_prg test_prg [.] test_thread+0x30 [.] test_thread+0x10
2.67% test_prg test_prg [.] test_thread+0x16 [.] test_thread+0x28
2.46% test_prg test_prg [.] test_thread+0x10 [.] test_thread+0x2e
2.44% test_prg test_prg [.] test_thread+0x2a [.] test_thread+0x16
2.38% test_prg test_prg [.] test_thread+0x14 [.] test_thread+0x2a
2.32% test_prg test_prg [.] test_thread+0x2e [.] test_thread+0x12
2.28% test_prg test_prg [.] test_thread+0x12 [.] test_thread+0x2c
2.16% test_prg test_prg [.] test_thread+0x2c [.] test_thread+0x14
0.02% test_prg [kernel.vmlinux] [k] asm_sysvec_apic_ti+0x5 [k] error_entry
Signed-off-by: Stephane Eranian <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Kim Phillips <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Song Liu <[email protected]>
Link: http://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
In branch mode, the branch symbols were being displayed with incorrect
cpumode labels. So fix this.
For example, before:
# perf record -b -a -- sleep 1
# perf report -b
Overhead Command Source Shared Object Source Symbol Target Symbol
0.08% swapper [kernel.kallsyms] [k] rcu_idle_enter [k] cpuidle_enter_state
==> 0.08% cmd0 [kernel.kallsyms] [.] psi_group_change [.] psi_group_change
0.08% cmd1 [kernel.kallsyms] [k] psi_group_change [k] psi_group_change
After:
# perf report -b
Overhead Command Source Shared Object Source Symbol Target Symbol
0.08% swapper [kernel.kallsyms] [k] rcu_idle_enter [k] cpuidle_enter_state
0.08% cmd0 [kernel.kallsyms] [k] psi_group_change [k] pei_group_change
0.08% cmd1 [kernel.kallsyms] [k] psi_group_change [k] psi_group_change
Reviewed-by: James Clark <[email protected]>
Signed-off-by: German Gomez <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Namhyung Kim <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux
Pull perf tool updates from Arnaldo Carvalho de Melo:
"New features:
- Add 'trace' subcommand for 'perf ftrace', setting the stage for
more 'perf ftrace' subcommands. Not using a subcommand yields the
previous behaviour of 'perf ftrace'.
- Add 'latency' subcommand to 'perf ftrace', that can use the
function graph tracer or a BPF optimized one, via the -b/--use-bpf
option.
E.g.:
$ sudo perf ftrace latency -a -T mutex_lock sleep 1
# DURATION | COUNT | GRAPH |
0 - 1 us | 4596 | ######################## |
1 - 2 us | 1680 | ######### |
2 - 4 us | 1106 | ##### |
4 - 8 us | 546 | ## |
8 - 16 us | 562 | ### |
16 - 32 us | 1 | |
32 - 64 us | 0 | |
64 - 128 us | 0 | |
128 - 256 us | 0 | |
256 - 512 us | 0 | |
512 - 1024 us | 0 | |
1 - 2 ms | 0 | |
2 - 4 ms | 0 | |
4 - 8 ms | 0 | |
8 - 16 ms | 0 | |
16 - 32 ms | 0 | |
32 - 64 ms | 0 | |
64 - 128 ms | 0 | |
128 - 256 ms | 0 | |
256 - 512 ms | 0 | |
512 - 1024 ms | 0 | |
1 - ... s | 0 | |
The original implementation of this command was in the bcc tool.
- Support --cputype option for hybrid events in 'perf stat'.
Improvements:
- Call chain improvements for ARM64.
- No need to do any affinity setup when profiling pids.
- Reduce multiplexing with duration_time in 'perf stat' metrics.
- Improve error message for uncore events, stating that some event
groups are can only be used in system wide (-a) mode.
- perf stat metric group leader fixes/improvements, including arch
specific changes to better support Intel topdown events.
- Probe non-deprecated sysfs path first, i.e. try the path
/sys/devices/system/cpu/cpuN/topology/thread_siblings first, then
the old /sys/devices/system/cpu/cpuN/topology/core_cpus.
- Disable debuginfod by default in 'perf record', to avoid stalls on
distros such as Fedora 35.
- Use unbuffered output in 'perf bench' when pipe/tee'ing to a file.
- Enable ignore_missing_thread in 'perf trace'
Fixes:
- Avoid TUI crash when navigating in the annotation of recursive
functions.
- Fix hex dump character output in 'perf script'.
- Fix JSON indentation to 4 spaces standard in the ARM vendor event
files.
- Fix use after free in metric__new().
- Fix IS_ERR_OR_NULL() usage in the perf BPF loader.
- Fix up cross-arch register support, i.e. when printing register
names take into account the architecture where the perf.data file
was collected.
- Fix SMT fallback with large core counts.
- Don't lower case MetricExpr when parsing JSON files so as not to
lose info such as the ":G" event modifier in metrics.
perf test:
- Add basic stress test for sigtrap handling to 'perf test'.
- Fix 'perf test' failures on s/390
- Enable system wide for metricgroups test in 'perf test´.
- Use 3 digits for test numbering now we can have more tests.
Arch specific:
- Add events for Arm Neoverse N2 in the ARM JSON vendor event files
- Support PERF_MEM_LVLNUM encodings in powerpc, that came from a
single patch series, where I incorrectly merged the kernel bits,
that were then reverted after coordination with Michael Ellerman
and Stephen Rothwell.
- Add ARM SPE total latency as PERF_SAMPLE_WEIGHT.
- Update AMD documentation, with info on raw event encoding.
- Add support for global and local variants of the "p_stage_cyc" sort
key, applicable to perf.data files collected on powerpc.
- Remove duplicate and incorrect aux size checks in the ARM CoreSight
ETM code.
Refactorings:
- Add a perf_cpu abstraction to disambiguate CPUs and CPU map
indexes, fixing problems along the way.
- Document CPU map methods.
UAPI sync:
- Update arch/x86/lib/mem{cpy,set}_64.S copies used in 'perf bench
mem memcpy'
- Sync UAPI files with the kernel sources: drm, msr-index,
cpufeatures.
Build system
- Enable warnings through HOSTCFLAGS.
- Drop requirement for libstdc++.so for libopencsd check
libperf:
- Make libperf adopt perf_counts_values__scale() from tools/perf/util/.
- Add a stat multiplexing test to libperf"
* tag 'perf-tools-for-v5.17-2022-01-16' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux: (115 commits)
perf record: Disable debuginfod by default
perf evlist: No need to do any affinity setup when profiling pids
perf cpumap: Add is_dummy() method
perf metric: Fix metric_leader
perf cputopo: Fix CPU topology reading on s/390
perf metricgroup: Fix use after free in metric__new()
libperf tests: Update a use of the new cpumap API
perf arm: Fix off-by-one directory path
tools arch x86: Sync the msr-index.h copy with the kernel sources
tools headers cpufeatures: Sync with the kernel sources
tools headers UAPI: Update tools's copy of drm.h header
tools arch: Update arch/x86/lib/mem{cpy,set}_64.S copies used in 'perf bench mem memcpy'
perf pmu-events: Don't lower case MetricExpr
perf expr: Add debug logging for literals
perf tools: Probe non-deprecated sysfs path 1st
perf tools: Fix SMT fallback with large core counts
perf cpumap: Give CPUs their own type
perf stat: Correct first_shadow_cpu to return index
perf script: Fix flipped index and cpu
perf c2c: Use more intention revealing iterator
...
|
|
Sort key 'p_stage_cyc' is used to present the latency cycles spent in
pipeline stages.
perf has local 'p_stage_cyc' sort key to display this info. There is no
global variant available for this sort key. The local variant shows
latency in a single sample, whereas the global value will be useful to
present the total latency (sum of latencies) in the hist entry. It
represents the latency number multiplied by the number of samples.
Add global ('p_stage_cyc') and local variant ('local_p_stage_cyc') for
this sort key. Use 'local_p_stage_cyc' as default option for "mem" sort
mode.
Also add this to the list of dynamic sort keys and made the
"dynamic_headers" and "arch_specific_sort_keys" as static.
Reported-by: Namhyung Kim <[email protected]>
Signed-off-by: Athira Jajeev <[email protected]>
Tested-by: Nageswara R Sastry <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Kajol Jain <[email protected]>
Cc: Madhavan Srinivasan <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
Add new '__rel_loc' dynamic data location attribute support.
This type attribute is similar to the '__data_loc' but records the
offset from the field itself.
The libtraceevent adds TEP_FIELD_IS_RELATIVE to the
'tep_format_field::flags' with TEP_FIELD_IS_DYNAMIC for'__rel_loc'.
Link: https://lkml.kernel.org/r/163757344810.510314.12449413842136229871.stgit@devnote2
Cc: Beau Belgrave <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Tom Zanussi <[email protected]>
Signed-off-by: Masami Hiramatsu <[email protected]>
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
|
|
andle 'p_stage_cyc' (for pipeline stage cycles) sort key with the same
rationale as for the 'weight' and 'local_weight', see the fix in this
series for a full explanation.
Not sure it also needs the local and global variants.
But I couldn't test it actually because I don't have the machine.
Reviewed-by: Athira Jajeev <[email protected]>
Signed-off-by: Namhyung Kim <[email protected]>
Tested-by: Athira Jajeev <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Athira Jajeev <[email protected]>
Cc: Ian Rogers <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
Handle 'ins_lat' (for instruction latency) and 'local_ins_lat' sort keys
with the same rationale as for the 'weight' and 'local_weight', see the
previous fix in this series for a full explanation.
But I couldn't test it actually, so only build tested.
Reviewed-by: Athira Jajeev <[email protected]>
Signed-off-by: Namhyung Kim <[email protected]>
Tested-by: Athira Jajeev <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Athira Jajeev <[email protected]>
Cc: Ian Rogers <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
Currently, the 'weight' field in the perf sample has latency information
for some instructions like in memory accesses. And perf tool has 'weight'
and 'local_weight' sort keys to display the info.
But it's somewhat confusing what it shows exactly. In my understanding,
'local_weight' shows a weight in a single sample, and (global) 'weight'
shows a sum of the weights in the hist_entry.
For example:
$ perf mem record -t load dd if=/dev/zero of=/dev/null bs=4k count=1M
$ perf report --stdio -n -s +local_weight
...
#
# Overhead Samples Command Shared Object Symbol Local Weight
# ........ ....... ....... ................ ......................... ............
#
21.23% 313 dd [kernel.vmlinux] [k] lockref_get_not_zero 32
12.43% 183 dd [kernel.vmlinux] [k] lockref_get_not_zero 35
11.97% 159 dd [kernel.vmlinux] [k] lockref_get_not_zero 36
10.40% 141 dd [kernel.vmlinux] [k] lockref_put_return 32
7.63% 113 dd [kernel.vmlinux] [k] lockref_get_not_zero 33
6.37% 92 dd [kernel.vmlinux] [k] lockref_get_not_zero 34
6.15% 90 dd [kernel.vmlinux] [k] lockref_put_return 33
...
So let's look at the 'lockref_get_not_zero' symbols. The top entry
shows that 313 samples were captured with 'local_weight' 32, so the
total weight should be 313 x 32 = 10016. But it's not the case:
$ perf report --stdio -n -s +local_weight,weight -S lockref_get_not_zero
...
#
# Overhead Samples Command Shared Object Local Weight Weight
# ........ ....... ....... ................ ............ ......
#
1.36% 4 dd [kernel.vmlinux] 36 144
0.47% 4 dd [kernel.vmlinux] 37 148
0.42% 4 dd [kernel.vmlinux] 32 128
0.40% 4 dd [kernel.vmlinux] 34 136
0.35% 4 dd [kernel.vmlinux] 36 144
0.34% 4 dd [kernel.vmlinux] 35 140
0.30% 4 dd [kernel.vmlinux] 36 144
0.30% 4 dd [kernel.vmlinux] 34 136
0.30% 4 dd [kernel.vmlinux] 32 128
0.30% 4 dd [kernel.vmlinux] 32 128
...
With the 'weight' sort key, it's divided to 4 samples even with the same
info ('comm', 'dso', 'sym' and 'local_weight'). I don't think this is
what we want.
I found this because of the way it aggregates the 'weight' value. Since
it's not a period, we should not add them in the he->stat. Otherwise,
two 32 'weight' entries will create a 64 'weight' entry.
After that, new 32 'weight' samples don't have a matching entry so it'd
create a new entry and make it a 64 'weight' entry again and again.
Later, they will be merged into 128 'weight' entries during the
hists__collapse_resort() with 4 samples, multiple times like above.
Let's keep the weight and display it differently. For 'local_weight',
it can show the weight as is, and for (global) 'weight' it can display
the number multiplied by the number of samples.
With this change, I can see the expected numbers.
$ perf report --stdio -n -s +local_weight,weight -S lockref_get_not_zero
...
#
# Overhead Samples Command Shared Object Local Weight Weight
# ........ ....... ....... ................ ............ .....
#
21.23% 313 dd [kernel.vmlinux] 32 10016
12.43% 183 dd [kernel.vmlinux] 35 6405
11.97% 159 dd [kernel.vmlinux] 36 5724
7.63% 113 dd [kernel.vmlinux] 33 3729
6.37% 92 dd [kernel.vmlinux] 34 3128
4.17% 59 dd [kernel.vmlinux] 37 2183
0.08% 1 dd [kernel.vmlinux] 269 269
0.08% 1 dd [kernel.vmlinux] 38 38
Reviewed-by: Athira Jajeev <[email protected]>
Signed-off-by: Namhyung Kim <[email protected]>
Tested-by: Athira Jajeev <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Ian Rogers <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
ASan reports the memory leak of the strings allocated by sort_help() when
running perf report.
This patch changes the returned pointer to char* (instead of const
char*), saves it in a temporary variable, and finally deallocates it at
function exit.
Signed-off-by: Riccardo Mancini <[email protected]>
Fixes: 702fb9b415e7c99b ("perf report: Show all sort keys in help output")
Cc: Andi Kleen <[email protected]>
Cc: Ian Rogers <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lore.kernel.org/lkml/a38b13f02812a8a6759200b9063c6191337f44d4.1626343282.git.rickyman7@gmail.com
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|
|
The sort dimension "p_stage_cyc" is used to represent pipeline
stage cycle information. Presently, this is used only in powerpc.
For unsupported platforms, we don't want to display it
in the perf report output columns. Hence add check in sort_dimension__add()
and skip the sort key incase it is not applicable for the particular arch.
Signed-off-by: Athira Rajeev <[email protected]>
Reviewed-by: Madhavan Srinivasan <[email protected]>
Acked-by: Jiri Olsa <[email protected]>
Cc: Kajol Jain <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ravi Bangoria <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
|