Profiling & Benchmarking

You cannot optimize what you have not measured. Profiling tells you where your program spends time and memory. Benchmarking tells you whether your changes made things better or worse. This topic covers the tools and workflow that make performance work systematic instead of guesswork.

Micro-Benchmarks with Criterion

Criterion is the standard benchmarking library for Rust. It handles warmup, statistical analysis, and comparison across runs.

Add it to your project:

// Cargo.toml
[dev-dependencies]
criterion = { version = "0.5", features = ["html_reports"] }

[[bench]]
name = "my_bench"
harness = false

Write a benchmark:

// benches/my_bench.rs
use criterion::{black_box, criterion_group, criterion_main, Criterion};

fn fibonacci(n: u64) -> u64 {
    match n {
        0 => 0,
        1 => 1,
        _ => fibonacci(n - 1) + fibonacci(n - 2),
    }
}

fn bench_fibonacci(c: &mut Criterion) {
    c.bench_function("fib 20", |b| {
        b.iter(|| fibonacci(black_box(20)))
    });

    c.bench_function("fib 30", |b| {
        b.iter(|| fibonacci(black_box(30)))
    });
}

criterion_group!(benches, bench_fibonacci);
criterion_main!(benches);

Run it:

cargo bench

fib 20                  time:   [25.312 us 25.489 us 25.701 us]
fib 30                  time:   [3.1245 ms 3.1456 ms 3.1678 ms]

Criterion runs your function thousands of times, measures the distribution, and reports the confidence interval. It also compares against the previous run:

fib 20                  time:   [24.100 us 24.289 us 24.501 us]
                        change: [-5.1234% -4.7012% -4.2890%] (p = 0.00 < 0.05)
                        Performance has improved.

black_box

black_box prevents the compiler from optimizing away your benchmark. Without it, the compiler might compute the result at compile time or eliminate dead code:

// BAD: compiler might optimize this away entirely
b.iter(|| fibonacci(20));

// GOOD: black_box hides the input from the optimizer
b.iter(|| fibonacci(black_box(20)));

Comparing Implementations

Benchmark groups let you compare alternative implementations:

fn bench_sorting(c: &mut Criterion) {
    let mut group = c.benchmark_group("sorting");
    let data: Vec<i32> = (0..1000).rev().collect();

    group.bench_function("std_sort", |b| {
        b.iter_batched(
            || data.clone(),
            |mut d| d.sort(),
            criterion::BatchSize::SmallInput,
        )
    });

    group.bench_function("std_sort_unstable", |b| {
        b.iter_batched(
            || data.clone(),
            |mut d| d.sort_unstable(),
            criterion::BatchSize::SmallInput,
        )
    });

    group.finish();
}

iter_batched creates fresh input for each iteration, which matters when the function mutates its input.

CPU Profiling with Flamegraphs

Flamegraphs show where your program spends CPU time as a visual stack trace. Install the tool:

cargo install flamegraph

Run your program with profiling:

cargo flamegraph --bin my_app

On macOS, you may need:

cargo flamegraph --root --bin my_app

This produces a flamegraph.svg file. Open it in a browser. Each horizontal bar is a function. Width represents the proportion of CPU time. The wider the bar, the more time spent there.

Reading a Flamegraph

Look for wide plateaus — these are hot functions. The stack grows downward, so the bottom is main and the top is where time is actually spent.

Common things you will find:

A wide bar on memcpy means you are copying data unnecessarily.
A wide bar on malloc or free means allocation is a bottleneck.
A wide bar on serde_json::de means JSON deserialization dominates.

Linux perf

On Linux, perf provides lower-level profiling:

perf record --call-graph dwarf ./target/release/my_app
perf report

perf uses hardware performance counters and has lower overhead than sampling profilers. It shows the same information as flamegraphs but in an interactive TUI.

Memory Profiling with DHAT

DHAT (Dynamic Heap Analysis Tool) tracks every heap allocation. It tells you how many bytes were allocated, how often, and from where.

Add the crate:

// Cargo.toml
[dependencies]
dhat = "0.3"

// Only enable when profiling
[features]
dhat-heap = []

Instrument your code:

#[cfg(feature = "dhat-heap")]
#[global_allocator]
static ALLOC: dhat::Alloc = dhat::Alloc;

fn main() {
    #[cfg(feature = "dhat-heap")]
    let _profiler = dhat::Profiler::new_heap();

    // Your application code here
    run_app();
}

Run with the feature enabled:

cargo run --release --features dhat-heap

dhat: Total:     1,234,567 bytes in 5,678 blocks
dhat: At t-gmax: 456,789 bytes in 1,234 blocks
dhat: At t-end:  0 bytes in 0 blocks

DHAT produces a JSON file you can view in the DHAT viewer. It shows each allocation site with byte counts, helping you find code that allocates more than it should.

The Profiling Workflow

Effective performance work follows a cycle:

Step 1: Measure the Baseline

Before changing anything, establish numbers:

use std::time::Instant;

fn main() {
    let start = Instant::now();
    let result = process_data(&load_data());
    let elapsed = start.elapsed();
    println!("Processing took {:?}", elapsed);
}

For benchmarks, run cargo bench and save the report.

Step 2: Identify the Hot Path

Run a flamegraph or profiler. Do not guess where time is spent. Developers are notoriously bad at predicting bottlenecks.

cargo flamegraph --bin my_app -- --input large_dataset.csv

Step 3: Optimize the Bottleneck

Once you know where the time goes, make targeted changes. Common wins:

Replace Vec<String> with Vec<&str> to avoid allocation
Use HashMap::with_capacity to avoid rehashing
Replace clone() with borrowing
Switch from serde_json to simd_json for parsing

Step 4: Measure Again

Run the same benchmark. Compare:

cargo bench

processing              time:   [45.312 ms 45.789 ms 46.201 ms]
                        change: [-23.456% -22.890% -22.345%] (p = 0.00 < 0.05)
                        Performance has improved.

If the numbers did not improve, your hypothesis was wrong. Go back to Step 2.

Step 5: Repeat

Performance work is iterative. After fixing one bottleneck, the next largest one becomes visible. Stop when the performance meets your requirements.

Profiling Async Code

Async profiling has a caveat: the executor interleaves tasks, making flamegraphs harder to read. Use tokio-console for runtime-level insight:

cargo install tokio-console

// Cargo.toml
[dependencies]
console-subscriber = "0.2"
tokio = { version = "1", features = ["full", "tracing"] }

fn main() {
    console_subscriber::init();
    // Your tokio application
}

tokio-console

This shows live task states, poll times, and waker statistics. It helps identify tasks that are slow to poll or are being starved.

Common Pitfalls

Benchmarking in debug mode. Always use --release. Debug mode disables inlining, loop unrolling, and LLVM optimizations. Performance numbers in debug mode are meaningless.
Optimizing without profiling. Guessing the bottleneck and optimizing the wrong code wastes time and adds complexity. Always measure first.
Micro-benchmarking unrealistic workloads. A benchmark that processes 10 elements may not reflect behavior at 10 million. Benchmark with realistic data sizes.
Ignoring allocation. CPU profiling does not show allocation overhead. Run DHAT separately if you suspect allocation is a problem.
Not using black_box. Without it, the compiler may eliminate your benchmark entirely, producing misleadingly fast results.
Optimizing once and forgetting. Performance regresses over time as features are added. Run benchmarks in CI to catch regressions early.

Key Takeaways

Use Criterion for micro-benchmarks. It handles statistical analysis and cross-run comparison.
Use flamegraphs to identify hot paths visually. Wide bars are where optimization pays off.
Use DHAT for memory profiling. Allocation is often a hidden bottleneck.
Follow the workflow: measure, identify, optimize, measure again. Never skip steps.
Always benchmark in release mode with realistic data.
Profile async code with tokio-console for runtime-level task visibility.