1 files changed, 189 insertions, 0 deletions
diff --git a/tools/perf/Documentation/perf-amd-ibs.txt b/tools/perf/Documentation/perf-amd-ibs.txt
new file mode 100644
index 000000000000..2fd31d9d7b71
--- /dev/null
+++ b/tools/perf/Documentation/perf-amd-ibs.txt
@@ -0,0 +1,189 @@
+perf-amd-ibs(1)
+===============
+
+NAME
+----
+perf-amd-ibs - Support for AMD Instruction-Based Sampling (IBS) with perf tool
+
+SYNOPSIS
+--------
+[verse]
+'perf record' -e ibs_op//
+'perf record' -e ibs_fetch//
+
+DESCRIPTION
+-----------
+
+Instruction-Based Sampling (IBS) provides precise Instruction Pointer (IP)
+profiling support on AMD platforms. IBS has two independent components: IBS
+Op and IBS Fetch. IBS Op sampling provides information about instruction
+execution (micro-op execution to be precise) with details like d-cache
+hit/miss, d-TLB hit/miss, cache miss latency, load/store data source, branch
+behavior etc. IBS Fetch sampling provides information about instruction fetch
+with details like i-cache hit/miss, i-TLB hit/miss, fetch latency etc. IBS is
+per-smt-thread i.e. each SMT hardware thread contains standalone IBS units.
+
+Both, IBS Op and IBS Fetch, are exposed as PMUs by Linux and can be exploited
+using the Linux perf utility. The following files will be created at boot time
+if IBS is supported by the hardware and kernel.
+
+  /sys/bus/event_source/devices/ibs_op/
+  /sys/bus/event_source/devices/ibs_fetch/
+
+IBS Op PMU supports two events: cycles and micro ops. IBS Fetch PMU supports
+one event: fetch ops.
+
+IBS PMUs do not have user/kernel filtering capability and thus it requires
+CAP_SYS_ADMIN or CAP_PERFMON privilege.
+
+IBS VS. REGULAR CORE PMU
+------------------------
+
+IBS gives samples with precise IP, i.e. the IP recorded with IBS sample has
+no skid. Whereas the IP recorded by regular core PMU will have some skid
+(sample was generated at IP X but perf would record it at IP X+n). Hence,
+regular core PMU might not help for profiling with instruction level
+precision. Further, IBS provides additional information about the sample in
+question. On the other hand, regular core PMU has it's own advantages like
+plethora of events, counting mode (less interference), up to 6 parallel
+counters, event grouping support, filtering capabilities etc.
+
+Three regular core PMU events are internally forwarded to IBS Op PMU when
+precise_ip attribute is set:
+
+	-e cpu-cycles:p becomes -e ibs_op//
+	-e r076:p becomes -e ibs_op//
+	-e r0C1:p becomes -e ibs_op/cnt_ctl=1/
+
+EXAMPLES
+--------
+
+IBS Op PMU
+~~~~~~~~~~
+
+System-wide profile, cycles event, sampling period: 100000
+
+	# perf record -e ibs_op// -c 100000 -a
+
+Per-cpu profile (cpu10), cycles event, sampling period: 100000
+
+	# perf record -e ibs_op// -c 100000 -C 10
+
+Per-cpu profile (cpu10), cycles event, sampling freq: 1000
+
+	# perf record -e ibs_op// -F 1000 -C 10
+
+System-wide profile, uOps event, sampling period: 100000
+
+	# perf record -e ibs_op/cnt_ctl=1/ -c 100000 -a
+
+Same command, but also capture IBS register raw dump along with perf sample:
+
+	# perf record -e ibs_op/cnt_ctl=1/ -c 100000 -a --raw-samples
+
+System-wide profile, uOps event, sampling period: 100000, L3MissOnly (Zen4 onward)
+
+	# perf record -e ibs_op/cnt_ctl=1,l3missonly=1/ -c 100000 -a
+
+Per process(upstream v6.2 onward), uOps event, sampling period: 100000
+
+	# perf record -e ibs_op/cnt_ctl=1/ -c 100000 -p 1234
+
+Per process(upstream v6.2 onward), uOps event, sampling period: 100000
+
+	# perf record -e ibs_op/cnt_ctl=1/ -c 100000 -- ls
+
+To analyse recorded profile in aggregate mode
+
+	# perf report
+	/* Select a line and press 'a' to drill down at instruction level. */
+
+To go over each sample
+
+	# perf script
+
+Raw dump of IBS registers when profiled with --raw-samples
+
+	# perf report -D
+	/* Look for PERF_RECORD_SAMPLE */
+
+	Example register raw dump:
+
+	ibs_op_ctl:     000002c30006186a MaxCnt    100000 L3MissOnly 0 En 1
+		Val 1 CntCtl 0=cycles CurCnt       707
+	IbsOpRip:       ffffffff8204aea7
+	ibs_op_data:    0000010002550001 CompToRetCtr     1 TagToRetCtr   597
+		BrnRet 0  RipInvalid 0 BrnFuse 0 Microcode 1
+	ibs_op_data2:   0000000000000013 RmtNode 1 DataSrc 3=DRAM
+	ibs_op_data3:   0000000031960092 LdOp 0 StOp 1 DcL1TlbMiss 0
+		DcL2TlbMiss 0 DcL1TlbHit2M 1 DcL1TlbHit1G 0 DcL2TlbHit2M 0
+		DcMiss 1 DcMisAcc 0 DcWcMemAcc 0 DcUcMemAcc 0 DcLockedOp 0
+		DcMissNoMabAlloc 0 DcLinAddrValid 1 DcPhyAddrValid 1
+		DcL2TlbHit1G 0 L2Miss 1 SwPf 0 OpMemWidth 32 bytes
+		OpDcMissOpenMemReqs 12 DcMissLat     0 TlbRefillLat     0
+	IbsDCLinAd:     ff110008a5398920
+	IbsDCPhysAd:    00000008a5398920
+
+IBS applied in a real world usecase
+
+	~90% regression was observed in tbench with specific scheduler hint
+	which was counter intuitive. IBS profile of good and bad run captured
+	using perf helped in identifying exact cause of the problem:
+
+	https://lore.kernel.org/r/20220921063638.2489-1-kprateek.nayak@amd.com
+
+IBS Fetch PMU
+~~~~~~~~~~~~~
+
+Similar commands can be used with Fetch PMU as well.
+
+System-wide profile, fetch ops event, sampling period: 100000
+
+	# perf record -e ibs_fetch// -c 100000 -a
+
+System-wide profile, fetch ops event, sampling period: 100000, Random enable
+
+	# perf record -e ibs_fetch/rand_en=1/ -c 100000 -a
+
+	Random enable adds small degree of variability to sample period. This
+	helps in cases like long running loops where PMU is tagging the same
+	instruction over and over because of fixed sample period.
+
+etc.
+
+PERF MEM AND PERF C2C
+---------------------
+
+perf mem is a memory access profiler tool and perf c2c is a shared data
+cacheline analyser tool. Both of them internally uses IBS Op PMU on AMD.
+Below is a simple example of the perf mem tool.
+
+	# perf mem record -c 100000 -- make
+	# perf mem report
+
+A normal perf mem report output will provide detailed memory access profile.
+However, it can also be aggregated based on output fields. For example:
+
+	# perf mem report -F mem,sample,snoop
+	Samples: 3M of event 'ibs_op//', Event count (approx.): 23524876
+	Memory access                                 Samples  Snoop
+	N/A                                           1903343  N/A
+	L1 hit                                        1056754  N/A
+	L2 hit                                          75231  N/A
+	L3 hit                                           9496  HitM
+	L3 hit                                           2270  N/A
+	RAM hit                                          8710  N/A
+	Remote node, same socket RAM hit                 3241  N/A
+	Remote core, same node Any cache hit             1572  HitM
+	Remote core, same node Any cache hit              514  N/A
+	Remote node, same socket Any cache hit           1216  HitM
+	Remote node, same socket Any cache hit            350  N/A
+	Uncached hit                                       18  N/A
+
+Please refer to their man page for more detail.
+
+SEE ALSO
+--------
+
+linkperf:perf-record[1], linkperf:perf-script[1], linkperf:perf-report[1],
+linkperf:perf-mem[1], linkperf:perf-c2c[1]