Skip to content

Profiling

Simon Ninon edited this page Mar 8, 2018 · 10 revisions

pprof

pprof helps diagnose which functions use most of the CPU resources. Note that this works better by compiling lib with a test file rather than linking against it.

Run

  • compile with -g and link with -lprofiler
  • run with LDPRELOAD=/usr/lib/libprofiler.so CPUPROFILE=cpu.profile ./bin
  • run with CPUPROFILE_FREQUENCY=xxx LDPRELOAD=/usr/lib/libprofiler.so CPUPROFILE=cpu.profile ./bin to change the sample rate (100 by default, aka 100 samples per second). For more information about samples, refer to interpretation section.
  • run google-pprof --text ./bin cpu.profile (switch to --gv, --pdf, --svg, ... depending on what output you prefer)

Output

Interpretation

  • Each box in the graph corresponds to a single function
  • The boxes are sized according to the number of samples in which the function was running
  • An edge from box X to box Y indicates that X calls Y
  • The number along the edge is the number of times that call appears in a sample
  • If a call appears multiple times in a single sample, such as during recursive function calls, each appearance counts toward the edge weight.
  • For each box, the numbers xxx (yyy%) of zzz (ttt%) can be translated to: local (percentage) of cumulative (percentage)

Basically, pprof works by doing sampling:

  • Program is run normally
  • Every regular interval (like 100 times a second), the program is interrupted and the callstacks are retrieved (it accounts as 1 sample).
  • Report is built based on these samples: if there are 100 samples and a function was caught running in 80 of them, then this function will have local=80 and the caller functions will have their cumulative adjusted by 80 accordingly (reflected by the node edges). Of course, the associated percentages are adjusted too.

This means that:

  • The more samples, the more accurate the report but also the more impact on the runtime
  • cumulative helps to understand the caller-callee relationship and where do the most expensive call come from
  • local helps to understand which functions are effectively more costly

Note that we are talking about CPU usage, so time spent executing the function overall. It is different from number of times the function is executed (which is not reported with pprof but can be retrieved with callgrind).

source 1 source 2

cachegrind

cachegrind helps diagnose data & instructions cache misses. Note that this works better by compiling lib with a test file rather than linking against it.

Run

  • compile with -g
  • run valgrind --tool=cachegrind ./bin
  • run cg_annotate cachegrind.out.xxxx to get a global report by file
  • run cg_annotate cachegrind.out.xxxx /abs/path/to/file.cpp to get a line-by-line report for a specific file
  • run cg_annotate cachegrind.out.xxxx --auto=yes to get a full line-by-line report for all files

Output

Interpretation

  • I: Instruction Cache
  • I1: L1 Instruction Cache
  • LLi: Last Level Instruction Cache (L2 on valgrind, simulates L3/L4/... basically last level for machines with more than 2 cache levels).
  • D: Data cache
  • D1: L1 Data cache
  • LLd: Last Level Data Cache (same as LLi)
  • r: read
  • w: write
  • mr: miss on read
  • mw: miss on write
  • Dmr, D1mr, LLdmr, ... combination of cache type and action type (e.g: D1mr is miss on read for L1 Data cache)

Misses are easy to spot, especially on the line-by-line detail.

If one line has a high rate of data misses, it may be interesting to reorganize the way the data is accessed to take profit of cache locality.

Not sure how to address instructions misses.

callgrind

callgrind helps diagnose which instructions are executed the most. Note that this works better by compiling lib with a test file rather than linking against it.

Run

  • compile with -g
  • run valgrind --tool=callgrind ./bin
  • run callgrind_annotate callgrind.out.xxxx to get a global report by file
  • run callgrind_annotate callgrind.out.xxxx ./relative/path/to/file.cpp to get a line-by-line report for a specific file
  • run callgrind_annotate callgrind.out.xxxx --auto=yes to get a full line-by-line report for all files

Output

Interpretation

Interpretation is straightforward here: it simply reports how many time a given instruction has been run.

Optimizing highly called areas or reducing the number of times they are executed is the right way to address it (if relevant of course: some instructions are called very frequently but have a slow impact on the overall runtime, so it should be combined with pprof report).

kcachegrind

kcachegrind provides is an alternative to cg_annotate and callgrind_annotate to navigate through the reports. It provides a graphic interface that can be much more convenient depending on the context.

Run

  • run kcachegrind callgrind.out.xxx or kcachegrind cachegrind.out.xxx

Approach

Top-down approach can be interesting to focus on the relevant areas.

Typically:

  • identify slow extended APIs based on google tests runtime.
  • identify slow components with pprof
  • once identified, find root cause using cachegrind and callgrind