diff options
Diffstat (limited to 'tools/perf/Documentation/perf-amd-ibs.txt')
-rw-r--r-- | tools/perf/Documentation/perf-amd-ibs.txt | 189 |
1 files changed, 189 insertions, 0 deletions
diff --git a/tools/perf/Documentation/perf-amd-ibs.txt b/tools/perf/Documentation/perf-amd-ibs.txt new file mode 100644 index 000000000000..2fd31d9d7b71 --- /dev/null +++ b/tools/perf/Documentation/perf-amd-ibs.txt @@ -0,0 +1,189 @@ +perf-amd-ibs(1) +=============== + +NAME +---- +perf-amd-ibs - Support for AMD Instruction-Based Sampling (IBS) with perf tool + +SYNOPSIS +-------- +[verse] +'perf record' -e ibs_op// +'perf record' -e ibs_fetch// + +DESCRIPTION +----------- + +Instruction-Based Sampling (IBS) provides precise Instruction Pointer (IP) +profiling support on AMD platforms. IBS has two independent components: IBS +Op and IBS Fetch. IBS Op sampling provides information about instruction +execution (micro-op execution to be precise) with details like d-cache +hit/miss, d-TLB hit/miss, cache miss latency, load/store data source, branch +behavior etc. IBS Fetch sampling provides information about instruction fetch +with details like i-cache hit/miss, i-TLB hit/miss, fetch latency etc. IBS is +per-smt-thread i.e. each SMT hardware thread contains standalone IBS units. + +Both, IBS Op and IBS Fetch, are exposed as PMUs by Linux and can be exploited +using the Linux perf utility. The following files will be created at boot time +if IBS is supported by the hardware and kernel. + + /sys/bus/event_source/devices/ibs_op/ + /sys/bus/event_source/devices/ibs_fetch/ + +IBS Op PMU supports two events: cycles and micro ops. IBS Fetch PMU supports +one event: fetch ops. + +IBS PMUs do not have user/kernel filtering capability and thus it requires +CAP_SYS_ADMIN or CAP_PERFMON privilege. + +IBS VS. REGULAR CORE PMU +------------------------ + +IBS gives samples with precise IP, i.e. the IP recorded with IBS sample has +no skid. Whereas the IP recorded by regular core PMU will have some skid +(sample was generated at IP X but perf would record it at IP X+n). Hence, +regular core PMU might not help for profiling with instruction level +precision. Further, IBS provides additional information about the sample in +question. On the other hand, regular core PMU has it's own advantages like +plethora of events, counting mode (less interference), up to 6 parallel +counters, event grouping support, filtering capabilities etc. + +Three regular core PMU events are internally forwarded to IBS Op PMU when +precise_ip attribute is set: + + -e cpu-cycles:p becomes -e ibs_op// + -e r076:p becomes -e ibs_op// + -e r0C1:p becomes -e ibs_op/cnt_ctl=1/ + +EXAMPLES +-------- + +IBS Op PMU +~~~~~~~~~~ + +System-wide profile, cycles event, sampling period: 100000 + + # perf record -e ibs_op// -c 100000 -a + +Per-cpu profile (cpu10), cycles event, sampling period: 100000 + + # perf record -e ibs_op// -c 100000 -C 10 + +Per-cpu profile (cpu10), cycles event, sampling freq: 1000 + + # perf record -e ibs_op// -F 1000 -C 10 + +System-wide profile, uOps event, sampling period: 100000 + + # perf record -e ibs_op/cnt_ctl=1/ -c 100000 -a + +Same command, but also capture IBS register raw dump along with perf sample: + + # perf record -e ibs_op/cnt_ctl=1/ -c 100000 -a --raw-samples + +System-wide profile, uOps event, sampling period: 100000, L3MissOnly (Zen4 onward) + + # perf record -e ibs_op/cnt_ctl=1,l3missonly=1/ -c 100000 -a + +Per process(upstream v6.2 onward), uOps event, sampling period: 100000 + + # perf record -e ibs_op/cnt_ctl=1/ -c 100000 -p 1234 + +Per process(upstream v6.2 onward), uOps event, sampling period: 100000 + + # perf record -e ibs_op/cnt_ctl=1/ -c 100000 -- ls + +To analyse recorded profile in aggregate mode + + # perf report + /* Select a line and press 'a' to drill down at instruction level. */ + +To go over each sample + + # perf script + +Raw dump of IBS registers when profiled with --raw-samples + + # perf report -D + /* Look for PERF_RECORD_SAMPLE */ + + Example register raw dump: + + ibs_op_ctl: 000002c30006186a MaxCnt 100000 L3MissOnly 0 En 1 + Val 1 CntCtl 0=cycles CurCnt 707 + IbsOpRip: ffffffff8204aea7 + ibs_op_data: 0000010002550001 CompToRetCtr 1 TagToRetCtr 597 + BrnRet 0 RipInvalid 0 BrnFuse 0 Microcode 1 + ibs_op_data2: 0000000000000013 RmtNode 1 DataSrc 3=DRAM + ibs_op_data3: 0000000031960092 LdOp 0 StOp 1 DcL1TlbMiss 0 + DcL2TlbMiss 0 DcL1TlbHit2M 1 DcL1TlbHit1G 0 DcL2TlbHit2M 0 + DcMiss 1 DcMisAcc 0 DcWcMemAcc 0 DcUcMemAcc 0 DcLockedOp 0 + DcMissNoMabAlloc 0 DcLinAddrValid 1 DcPhyAddrValid 1 + DcL2TlbHit1G 0 L2Miss 1 SwPf 0 OpMemWidth 32 bytes + OpDcMissOpenMemReqs 12 DcMissLat 0 TlbRefillLat 0 + IbsDCLinAd: ff110008a5398920 + IbsDCPhysAd: 00000008a5398920 + +IBS applied in a real world usecase + + ~90% regression was observed in tbench with specific scheduler hint + which was counter intuitive. IBS profile of good and bad run captured + using perf helped in identifying exact cause of the problem: + + https://lore.kernel.org/r/20220921063638.2489-1-kprateek.nayak@amd.com + +IBS Fetch PMU +~~~~~~~~~~~~~ + +Similar commands can be used with Fetch PMU as well. + +System-wide profile, fetch ops event, sampling period: 100000 + + # perf record -e ibs_fetch// -c 100000 -a + +System-wide profile, fetch ops event, sampling period: 100000, Random enable + + # perf record -e ibs_fetch/rand_en=1/ -c 100000 -a + + Random enable adds small degree of variability to sample period. This + helps in cases like long running loops where PMU is tagging the same + instruction over and over because of fixed sample period. + +etc. + +PERF MEM AND PERF C2C +--------------------- + +perf mem is a memory access profiler tool and perf c2c is a shared data +cacheline analyser tool. Both of them internally uses IBS Op PMU on AMD. +Below is a simple example of the perf mem tool. + + # perf mem record -c 100000 -- make + # perf mem report + +A normal perf mem report output will provide detailed memory access profile. +However, it can also be aggregated based on output fields. For example: + + # perf mem report -F mem,sample,snoop + Samples: 3M of event 'ibs_op//', Event count (approx.): 23524876 + Memory access Samples Snoop + N/A 1903343 N/A + L1 hit 1056754 N/A + L2 hit 75231 N/A + L3 hit 9496 HitM + L3 hit 2270 N/A + RAM hit 8710 N/A + Remote node, same socket RAM hit 3241 N/A + Remote core, same node Any cache hit 1572 HitM + Remote core, same node Any cache hit 514 N/A + Remote node, same socket Any cache hit 1216 HitM + Remote node, same socket Any cache hit 350 N/A + Uncached hit 18 N/A + +Please refer to their man page for more detail. + +SEE ALSO +-------- + +linkperf:perf-record[1], linkperf:perf-script[1], linkperf:perf-report[1], +linkperf:perf-mem[1], linkperf:perf-c2c[1] |