aboutsummaryrefslogtreecommitdiff
path: root/tools/perf/Documentation/perf-amd-ibs.txt
diff options
context:
space:
mode:
Diffstat (limited to 'tools/perf/Documentation/perf-amd-ibs.txt')
-rw-r--r--tools/perf/Documentation/perf-amd-ibs.txt189
1 files changed, 189 insertions, 0 deletions
diff --git a/tools/perf/Documentation/perf-amd-ibs.txt b/tools/perf/Documentation/perf-amd-ibs.txt
new file mode 100644
index 000000000000..2fd31d9d7b71
--- /dev/null
+++ b/tools/perf/Documentation/perf-amd-ibs.txt
@@ -0,0 +1,189 @@
+perf-amd-ibs(1)
+===============
+
+NAME
+----
+perf-amd-ibs - Support for AMD Instruction-Based Sampling (IBS) with perf tool
+
+SYNOPSIS
+--------
+[verse]
+'perf record' -e ibs_op//
+'perf record' -e ibs_fetch//
+
+DESCRIPTION
+-----------
+
+Instruction-Based Sampling (IBS) provides precise Instruction Pointer (IP)
+profiling support on AMD platforms. IBS has two independent components: IBS
+Op and IBS Fetch. IBS Op sampling provides information about instruction
+execution (micro-op execution to be precise) with details like d-cache
+hit/miss, d-TLB hit/miss, cache miss latency, load/store data source, branch
+behavior etc. IBS Fetch sampling provides information about instruction fetch
+with details like i-cache hit/miss, i-TLB hit/miss, fetch latency etc. IBS is
+per-smt-thread i.e. each SMT hardware thread contains standalone IBS units.
+
+Both, IBS Op and IBS Fetch, are exposed as PMUs by Linux and can be exploited
+using the Linux perf utility. The following files will be created at boot time
+if IBS is supported by the hardware and kernel.
+
+ /sys/bus/event_source/devices/ibs_op/
+ /sys/bus/event_source/devices/ibs_fetch/
+
+IBS Op PMU supports two events: cycles and micro ops. IBS Fetch PMU supports
+one event: fetch ops.
+
+IBS PMUs do not have user/kernel filtering capability and thus it requires
+CAP_SYS_ADMIN or CAP_PERFMON privilege.
+
+IBS VS. REGULAR CORE PMU
+------------------------
+
+IBS gives samples with precise IP, i.e. the IP recorded with IBS sample has
+no skid. Whereas the IP recorded by regular core PMU will have some skid
+(sample was generated at IP X but perf would record it at IP X+n). Hence,
+regular core PMU might not help for profiling with instruction level
+precision. Further, IBS provides additional information about the sample in
+question. On the other hand, regular core PMU has it's own advantages like
+plethora of events, counting mode (less interference), up to 6 parallel
+counters, event grouping support, filtering capabilities etc.
+
+Three regular core PMU events are internally forwarded to IBS Op PMU when
+precise_ip attribute is set:
+
+ -e cpu-cycles:p becomes -e ibs_op//
+ -e r076:p becomes -e ibs_op//
+ -e r0C1:p becomes -e ibs_op/cnt_ctl=1/
+
+EXAMPLES
+--------
+
+IBS Op PMU
+~~~~~~~~~~
+
+System-wide profile, cycles event, sampling period: 100000
+
+ # perf record -e ibs_op// -c 100000 -a
+
+Per-cpu profile (cpu10), cycles event, sampling period: 100000
+
+ # perf record -e ibs_op// -c 100000 -C 10
+
+Per-cpu profile (cpu10), cycles event, sampling freq: 1000
+
+ # perf record -e ibs_op// -F 1000 -C 10
+
+System-wide profile, uOps event, sampling period: 100000
+
+ # perf record -e ibs_op/cnt_ctl=1/ -c 100000 -a
+
+Same command, but also capture IBS register raw dump along with perf sample:
+
+ # perf record -e ibs_op/cnt_ctl=1/ -c 100000 -a --raw-samples
+
+System-wide profile, uOps event, sampling period: 100000, L3MissOnly (Zen4 onward)
+
+ # perf record -e ibs_op/cnt_ctl=1,l3missonly=1/ -c 100000 -a
+
+Per process(upstream v6.2 onward), uOps event, sampling period: 100000
+
+ # perf record -e ibs_op/cnt_ctl=1/ -c 100000 -p 1234
+
+Per process(upstream v6.2 onward), uOps event, sampling period: 100000
+
+ # perf record -e ibs_op/cnt_ctl=1/ -c 100000 -- ls
+
+To analyse recorded profile in aggregate mode
+
+ # perf report
+ /* Select a line and press 'a' to drill down at instruction level. */
+
+To go over each sample
+
+ # perf script
+
+Raw dump of IBS registers when profiled with --raw-samples
+
+ # perf report -D
+ /* Look for PERF_RECORD_SAMPLE */
+
+ Example register raw dump:
+
+ ibs_op_ctl: 000002c30006186a MaxCnt 100000 L3MissOnly 0 En 1
+ Val 1 CntCtl 0=cycles CurCnt 707
+ IbsOpRip: ffffffff8204aea7
+ ibs_op_data: 0000010002550001 CompToRetCtr 1 TagToRetCtr 597
+ BrnRet 0 RipInvalid 0 BrnFuse 0 Microcode 1
+ ibs_op_data2: 0000000000000013 RmtNode 1 DataSrc 3=DRAM
+ ibs_op_data3: 0000000031960092 LdOp 0 StOp 1 DcL1TlbMiss 0
+ DcL2TlbMiss 0 DcL1TlbHit2M 1 DcL1TlbHit1G 0 DcL2TlbHit2M 0
+ DcMiss 1 DcMisAcc 0 DcWcMemAcc 0 DcUcMemAcc 0 DcLockedOp 0
+ DcMissNoMabAlloc 0 DcLinAddrValid 1 DcPhyAddrValid 1
+ DcL2TlbHit1G 0 L2Miss 1 SwPf 0 OpMemWidth 32 bytes
+ OpDcMissOpenMemReqs 12 DcMissLat 0 TlbRefillLat 0
+ IbsDCLinAd: ff110008a5398920
+ IbsDCPhysAd: 00000008a5398920
+
+IBS applied in a real world usecase
+
+ ~90% regression was observed in tbench with specific scheduler hint
+ which was counter intuitive. IBS profile of good and bad run captured
+ using perf helped in identifying exact cause of the problem:
+
+ https://lore.kernel.org/r/20220921063638.2489-1-kprateek.nayak@amd.com
+
+IBS Fetch PMU
+~~~~~~~~~~~~~
+
+Similar commands can be used with Fetch PMU as well.
+
+System-wide profile, fetch ops event, sampling period: 100000
+
+ # perf record -e ibs_fetch// -c 100000 -a
+
+System-wide profile, fetch ops event, sampling period: 100000, Random enable
+
+ # perf record -e ibs_fetch/rand_en=1/ -c 100000 -a
+
+ Random enable adds small degree of variability to sample period. This
+ helps in cases like long running loops where PMU is tagging the same
+ instruction over and over because of fixed sample period.
+
+etc.
+
+PERF MEM AND PERF C2C
+---------------------
+
+perf mem is a memory access profiler tool and perf c2c is a shared data
+cacheline analyser tool. Both of them internally uses IBS Op PMU on AMD.
+Below is a simple example of the perf mem tool.
+
+ # perf mem record -c 100000 -- make
+ # perf mem report
+
+A normal perf mem report output will provide detailed memory access profile.
+However, it can also be aggregated based on output fields. For example:
+
+ # perf mem report -F mem,sample,snoop
+ Samples: 3M of event 'ibs_op//', Event count (approx.): 23524876
+ Memory access Samples Snoop
+ N/A 1903343 N/A
+ L1 hit 1056754 N/A
+ L2 hit 75231 N/A
+ L3 hit 9496 HitM
+ L3 hit 2270 N/A
+ RAM hit 8710 N/A
+ Remote node, same socket RAM hit 3241 N/A
+ Remote core, same node Any cache hit 1572 HitM
+ Remote core, same node Any cache hit 514 N/A
+ Remote node, same socket Any cache hit 1216 HitM
+ Remote node, same socket Any cache hit 350 N/A
+ Uncached hit 18 N/A
+
+Please refer to their man page for more detail.
+
+SEE ALSO
+--------
+
+linkperf:perf-record[1], linkperf:perf-script[1], linkperf:perf-report[1],
+linkperf:perf-mem[1], linkperf:perf-c2c[1]