Zero-Cost Abstractions

The phrase "zero-cost abstractions" is Rust's core promise for performance: you do not pay for what you do not use, and what you do use, you could not hand-code any better. This is not marketing. The compiler genuinely transforms high-level Rust into the same machine code you would write in C. Understanding where this holds — and where it does not — is essential for writing fast Rust.

Iterators Compile to Loops

Rust's iterator chains look high-level. They compile to tight loops with no overhead.

fn sum_of_squares(data: &[i64]) -> i64 {
    data.iter()
        .filter(|&&x| x > 0)
        .map(|&x| x * x)
        .sum()
}

The equivalent hand-written loop:

fn sum_of_squares_manual(data: &[i64]) -> i64 {
    let mut total = 0;
    for &x in data {
        if x > 0 {
            total += x * x;
        }
    }
    total
}

These two functions produce identical assembly in release mode. The iterator version creates no intermediate collections, no closures on the heap, no function pointer indirection. The compiler inlines everything.

You can verify this yourself:

use std::time::Instant;

fn main() {
    let data: Vec<i64> = (1..10_000_000).collect();

    let start = Instant::now();
    let result1 = sum_of_squares(&data);
    let elapsed1 = start.elapsed();

    let start = Instant::now();
    let result2 = sum_of_squares_manual(&data);
    let elapsed2 = start.elapsed();

    println!("Iterator: {} in {:?}", result1, elapsed1);
    println!("Manual:   {} in {:?}", result2, elapsed2);
}

Iterator: 166666616666670000 in 7.2ms
Manual:   166666616666670000 in 7.1ms

The times are within noise. There is no abstraction penalty.

Generics Compile to Specialized Code

Generics in Rust use monomorphization: the compiler generates a separate copy of the function for each concrete type it is called with.

fn max_value<T: PartialOrd>(a: T, b: T) -> T {
    if a > b { a } else { b }
}

fn main() {
    let x = max_value(10i32, 20i32);  // Compiles to max_value_i32
    let y = max_value(1.5f64, 2.5f64); // Compiles to max_value_f64
    println!("{} {}", x, y);
}

Each call site gets a version of max_value specialized for its type. There is no vtable lookup, no boxing, no runtime type dispatch. The generated code is identical to what you would write if you had separate max_i32 and max_f64 functions.

The tradeoff: larger binaries. Each specialization is a separate chunk of machine code. For most applications, this is irrelevant. For embedded systems with tight flash constraints, it can matter.

Async Compiles to State Machines

Rust's async/await looks like it creates threads or heap-allocated futures. It does not. Each async fn compiles to a state machine — an enum where each variant represents a suspension point.

async fn fetch_and_process(url: &str) -> Result<String, reqwest::Error> {
    let response = reqwest::get(url).await?;
    let body = response.text().await?;
    Ok(body.to_uppercase())
}

The compiler transforms this into roughly:

enum FetchAndProcess {
    Start { url: String },
    WaitingForResponse { future: GetFuture },
    WaitingForBody { future: TextFuture },
    Done,
}

Each .await becomes a state transition. No threads are created. No heap allocation for the future itself (it is sized and can live on the stack or be inlined). The runtime polls the state machine, advancing it when I/O completes.

Compare this to Go's goroutines, which allocate a stack per goroutine, or Python's asyncio, which allocates future objects on the heap. Rust's approach has zero overhead beyond what a hand-written state machine would require.

Where the Abstraction Has Cost

Zero-cost does not mean every Rust abstraction is free. Some features carry inherent costs.

Dynamic Dispatch with `dyn Trait`

When you use trait objects, the compiler cannot monomorphize. It inserts a vtable lookup:

fn print_all(items: &[Box<dyn std::fmt::Display>]) {
    for item in items {
        println!("{}", item); // vtable lookup per call
    }
}

Each call to Display::fmt goes through an indirection. For hot loops, this can matter. The fix is generics where possible:

fn print_all<T: std::fmt::Display>(items: &[T]) {
    for item in items {
        println!("{}", item); // direct call, inlined
    }
}

Use dyn Trait when you need heterogeneous collections. Use generics when all elements are the same type.

Arc and Atomic Reference Counting

Arc<T> is not free. Every clone and drop performs an atomic increment or decrement:

use std::sync::Arc;

let data = Arc::new(vec![1, 2, 3]);
let data2 = Arc::clone(&data); // atomic increment
drop(data2);                    // atomic decrement

Atomic operations are cheaper than a mutex but more expensive than nothing. In hot paths where you clone Arc millions of times per second, it shows up in profiles. The alternative is to pass references where possible and restructure to avoid repeated cloning.

Heap Allocation

Every Box, Vec, String, and HashMap allocates on the heap. Allocation itself is not free — it calls the system allocator, which involves bookkeeping and potential contention.

// This allocates
let s = String::from("hello");

// This does not
let s: &str = "hello";

When a function only needs to read a string, accept &str instead of String. This avoids an allocation at the call site.

Benchmarking to Prove It

Use the criterion crate to compare abstraction cost:

// benches/iterators.rs
use criterion::{black_box, criterion_group, criterion_main, Criterion};

fn sum_iterator(data: &[i64]) -> i64 {
    data.iter().filter(|&&x| x > 0).map(|&x| x * x).sum()
}

fn sum_loop(data: &[i64]) -> i64 {
    let mut total = 0;
    for &x in data {
        if x > 0 {
            total += x * x;
        }
    }
    total
}

fn benchmark(c: &mut Criterion) {
    let data: Vec<i64> = (-5000..5000).collect();

    c.bench_function("iterator", |b| {
        b.iter(|| sum_iterator(black_box(&data)))
    });

    c.bench_function("loop", |b| {
        b.iter(|| sum_loop(black_box(&data)))
    });
}

criterion_group!(benches, benchmark);
criterion_main!(benches);

iterator                time:   [1.234 us 1.241 us 1.248 us]
loop                    time:   [1.231 us 1.239 us 1.247 us]

Within measurement noise. The abstraction is genuinely zero-cost.

Common Pitfalls

Assuming dyn Trait is free. It is not. Virtual dispatch adds indirection. Use generics for performance-critical paths and dyn for flexibility.
Cloning Arc in tight loops. Each clone is an atomic operation. Restructure to borrow where possible.
Premature optimization. Before worrying about abstraction costs, measure. Most applications are bottlenecked by I/O, not iterator chains or generic dispatch.
Not compiling in release mode. Debug builds disable optimizations. Iterator chains are significantly slower in debug because nothing is inlined. Always benchmark with --release.
Confusing zero-cost with no-cost. Zero-cost means the abstraction has no overhead compared to the equivalent hand-written code. It does not mean the operation itself is free. Iterating over a million elements still takes time.

Key Takeaways

Iterators, generics, and async/await are genuinely zero-cost. The compiler produces the same machine code as hand-written equivalents.
Monomorphization specializes generic functions for each concrete type, eliminating runtime dispatch.
Async functions compile to state machines, not heap-allocated futures or threads.
Dynamic dispatch (dyn Trait), Arc, and heap allocation have real costs. Use them deliberately.
Always benchmark in release mode. Debug builds disable the optimizations that make abstractions zero-cost.
Measure before you optimize. The abstraction overhead is almost never the bottleneck.