Using perf on Linux¶
Introduction¶
perf
is a tool to analyze the performance of applications and of the kernel, on Linux-based systems.
It relies on syscall perf_event_open
(http://man7.org/linux/man-pages/man2/perf_event_open.2.html) to access performance monitoring facilities provided by the kernel.
These facilities consist in:
tracepoints (probes) in the kernel, the C library (glibc), some interpreters, etc.
processor counters from the PMU (Performance Instrumentation Unit), like Intel PMC (Performance Monitoring Counter), replaced by Intel PCM (Processor Counter Monitor), or PIC (Performance Instrumentation Counter)
hardware-assisted tracing, like Intel PT (Processor Tracing)
The access of the performance events system by unprivileged users is configured through sysctl kernel.perf_event_paranoid
(file /proc/sys/kernel/perf_event_paranoid
).
The value of this setting is documented on https://www.kernel.org/doc/Documentation/sysctl/kernel.txt:
-1
: Allow use of (almost) all events by all users. Ignoremlock
limit afterperf_event_mlock_kb
withoutCAP_IPC_LOCK
>= 0
: Disallowftrace
function tracepoint by users withoutCAP_SYS_ADMIN
. Disallow raw tracepoint access by users withoutCAP_SYS_ADMIN
>= 1
: Disallow CPU event access by users withoutCAP_SYS_ADMIN
>= 2
: Disallow kernel profiling by users withoutCAP_SYS_ADMIN
Usage¶
The tool named perf
works with subcommands (stat
, record
, report
…).
# Enumerate all symbolic event types
perf list
# Look for events related to KVM hypervisor
perf list 'kvm:*'
In order to collect several statistics about a command:
perf stat $COMMAND
Example with uname
:
# perf stat uname
Linux
Performance counter stats for 'uname':
0.50 msec task-clock # 0.551 CPUs utilized
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
67 page-faults # 0.133 M/sec
1,837,945 cycles # 3.656 GHz
1,266,497 instructions # 0.69 insn per cycle
284,608 branches # 566.071 M/sec
8,956 branch-misses # 3.15% of all branches
0.000911814 seconds time elapsed
0.001001000 seconds user
0.000000000 seconds sys
In order to record a trace of a command:
perf record $COMMAND
# --branch-any: enable taken branch stack sampling
# --call-graph=dwarf: enable call-graph (stack chain/backtrace) recording with DWARF information
perf record --branch-any --call-graph=dwarf $COMMAND
# Record a running process during 30 seconds
# -a = --all-cpus: system-wide collection from all CPUs
# -g (like --call-graph=fp): enable call-graph (stack chain/backtrace) recording
# -p = --pid: record events on existing process ID (comma separated list)
timeout 30s perf record -a -g -p $(pidof $MYPROCESS)
This creates a file named perf.data
, that can be analyzed with other subcommands.
# Show perf.data in an ncurses browser (TUI) if possible
perf report
# Dump the raw trace in ASCII
perf report -D
perf report --dump-raw-trace
# Display the trace output
perf script
# Show perf.data as:
# * a text report
# * with a column for sample count
# * with call stacks
# * with data coalesced and percentages
perf report --stdio -n -g folded
# List fields of header if the record was done with option -a
perf script --header -F comm,pid,tid,cpu,time,event,ip,sym,dso
The trace can also be analyzed with a GUI such as https://github.com/KDAB/hotspot.
When Intel PT (Processor Tracing) is available on the CPU, the following commands can be used to trace a program (from https://lkml.org/lkml/2019/11/27/160):
perf record -e '{intel_pt//,cpu/mem_inst_retired.all_loads,aux-sample-size=8192/pp}:u' $COMMAND
perf script -F +brstackinsn --xed --itrace=i1usl100
More recent versions of perf
introduced an equivalent of strace
without using the ptrace
syscall:
perf trace --call-graph=dwarf $COMMAND
# Or, with perf record:
perf record -e 'raw_syscalls:*' $COMMAND
# Trace with "augmented syscalls" (in order to see string parameters, for example)
perf trace -e /usr/lib/perf/examples/bpf/augmented_raw_syscalls.c $COMMAND
Flame Graphs¶
Using https://github.com/brendangregg/FlameGraph, it is very simple to produce a flamegraph out of a trace. This can be useful for example to find in a program what functions take much time and need to be better optimized.
# Record stack samples at 99 Hertz during 60 seconds
# (both userspace and kernel-space stacks, all processes)
perf record -F 99 -a -g -- sleep 60
# Fold the stacks into a text file
perf script | ./stackcollapse-perf.pl --all > out.folded
# Filter on names of processes, functions... and create a flamegraph
grep my_application < out.folded | ./flamegraph.pl --color=java > graph.svg
Another project enables producing flamegraphs for Rust projects: https://github.com/ferrous-systems/flamegraph
Documentation¶
https://perf.wiki.kernel.org/index.php/Tutorial perf Wiki - Tutorial
http://www.brendangregg.com/perf.html Linux perf Examples, documentation, links, and more!
http://www.brendangregg.com/flamegraphs.html Flame Graphs
https://github.com/brendangregg/perf-tools perf-tools GitHub project
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/Documentation/perf-record.txt perf-record man page
https://alexandrnikitin.github.io/blog/transparent-hugepages-measuring-the-performance-impact/ Transparent Hugepages: measuring the performance impact
https://twitter.com/b0rk/status/945900285460926464 perf cheat sheet by ulia Evans