Using perf on Linux

Introduction

perf is a tool to analyze the performance of applications and of the kernel, on Linux-based systems. It relies on syscall perf_event_open (http://man7.org/linux/man-pages/man2/perf_event_open.2.html) to access performance monitoring facilities provided by the kernel. These facilities consist in:

  • tracepoints (probes) in the kernel, the C library (glibc), some interpreters, etc.

  • processor counters from the PMU (Performance Instrumentation Unit), like Intel PMC (Performance Monitoring Counter), replaced by Intel PCM (Processor Counter Monitor), or PIC (Performance Instrumentation Counter)

  • hardware-assisted tracing, like Intel PT (Processor Tracing)

The access of the performance events system by unprivileged users is configured through sysctl kernel.perf_event_paranoid (file /proc/sys/kernel/perf_event_paranoid). The value of this setting is documented on https://www.kernel.org/doc/Documentation/sysctl/kernel.txt:

  • -1: Allow use of (almost) all events by all users. Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK

  • >= 0: Disallow ftrace function tracepoint by users without CAP_SYS_ADMIN. Disallow raw tracepoint access by users without CAP_SYS_ADMIN

  • >= 1: Disallow CPU event access by users without CAP_SYS_ADMIN

  • >= 2: Disallow kernel profiling by users without CAP_SYS_ADMIN

Usage

The tool named perf works with subcommands (stat, record, report…).

# Enumerate all symbolic event types
perf list

# Look for events related to KVM hypervisor
perf list 'kvm:*'

In order to collect several statistics about a command:

perf stat $COMMAND

Example with uname:

# perf stat uname
Linux

 Performance counter stats for 'uname':

              0.50 msec task-clock                #    0.551 CPUs utilized
                 0      context-switches          #    0.000 K/sec
                 0      cpu-migrations            #    0.000 K/sec
                67      page-faults               #    0.133 M/sec
         1,837,945      cycles                    #    3.656 GHz
         1,266,497      instructions              #    0.69  insn per cycle
           284,608      branches                  #  566.071 M/sec
             8,956      branch-misses             #    3.15% of all branches

       0.000911814 seconds time elapsed

       0.001001000 seconds user
       0.000000000 seconds sys

In order to record a trace of a command:

perf record $COMMAND

# --branch-any: enable taken branch stack sampling
# --call-graph=dwarf: enable call-graph (stack chain/backtrace) recording with DWARF information
perf record --branch-any --call-graph=dwarf $COMMAND

# Record a running process during 30 seconds
# -a = --all-cpus: system-wide collection from all CPUs
# -g (like --call-graph=fp): enable call-graph (stack chain/backtrace) recording
# -p = --pid: record events on existing process ID (comma separated list)
timeout 30s perf record -a -g -p $(pidof $MYPROCESS)

This creates a file named perf.data, that can be analyzed with other subcommands.

# Show perf.data in an ncurses browser (TUI) if possible
perf report

# Dump the raw trace in ASCII
perf report -D
perf report --dump-raw-trace

# Display the trace output
perf script

# Show perf.data as:
# * a text report
# * with a column for sample count
# * with call stacks
# * with data coalesced and percentages
perf report --stdio -n -g folded

# List fields of header if the record was done with option -a
perf script --header -F comm,pid,tid,cpu,time,event,ip,sym,dso

The trace can also be analyzed with a GUI such as https://github.com/KDAB/hotspot.

When Intel PT (Processor Tracing) is available on the CPU, the following commands can be used to trace a program (from https://lkml.org/lkml/2019/11/27/160):

perf record -e '{intel_pt//,cpu/mem_inst_retired.all_loads,aux-sample-size=8192/pp}:u' $COMMAND
perf script -F +brstackinsn --xed --itrace=i1usl100

More recent versions of perf introduced an equivalent of strace without using the ptrace syscall:

perf trace --call-graph=dwarf $COMMAND

# Or, with perf record:
perf record -e 'raw_syscalls:*' $COMMAND

# Trace with "augmented syscalls" (in order to see string parameters, for example)
perf trace -e /usr/lib/perf/examples/bpf/augmented_raw_syscalls.c $COMMAND

Flame Graphs

Using https://github.com/brendangregg/FlameGraph, it is very simple to produce a flamegraph out of a trace. This can be useful for example to find in a program what functions take much time and need to be better optimized.

# Record stack samples at 99 Hertz during 60 seconds
# (both userspace and kernel-space stacks, all processes)
perf record -F 99 -a -g -- sleep 60

# Fold the stacks into a text file
perf script | ./stackcollapse-perf.pl --all > out.folded

# Filter on names of processes, functions... and create a flamegraph
grep my_application < out.folded | ./flamegraph.pl --color=java > graph.svg

Another project enables producing flamegraphs for Rust projects: https://github.com/ferrous-systems/flamegraph