Profiling

Profiling tells you where your program spends its time. Without profiling, optimization is guesswork — you will optimize the function you think is slow instead of the function that actually consumes 80% of runtime. Profiling replaces intuition with measurement.

The Profiling Workflow

The correct order is always:

Write correct code — make it work first
Measure — profile to find the actual bottleneck
Optimize the bottleneck — improve only the hot path
Measure again — confirm the optimization helped

Optimizing before measuring wastes time. Optimizing without measuring afterward means you do not know if you helped or hurt.

gprof: Function-Level Profiling

gprof is the classic profiling tool included with GCC. It measures how much time each function consumes and how many times each function is called.

Using gprof

gcc -pg -g program.c -o program
./program                  # Generates gmon.out
gprof ./program gmon.out   # Generate the report

The -pg flag adds instrumentation code to every function. When the program runs, it writes profiling data to gmon.out. The gprof command reads this data and produces a report.

Reading gprof Output

  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 72.3      2.89     2.89   100000     0.03     0.03  process_record
 15.1      3.49     0.60        1   600.00   600.00  load_data
  8.2      3.82     0.33   100000     0.00     0.00  validate_input
  4.4      3.99     0.18        1   180.00  3990.00  main

This tells you that process_record takes 72.3% of execution time. That is where optimization effort should go. The function you thought was slow (load_data) is only 15.1%.

gprof Limitations

gprof uses statistical sampling — it interrupts the program periodically and records which function is executing. Short-running functions may be missed. gprof also does not profile time spent in system calls or I/O waits. For more accurate profiling, use perf or callgrind.

perf: Linux Hardware Counter Profiling

perf is the Linux kernel's profiling tool. It uses hardware performance counters built into the CPU, giving precise measurements with minimal overhead.

Basic perf Usage

gcc -g -O2 program.c -o program
perf stat ./program

 Performance counter stats for './program':

       3,987.42 msec task-clock
          2,451 context-switches
             89 cpu-migrations
         12,345 page-faults
 14,234,567,890 cycles
  9,876,543,210 instructions       # 0.69 insn per cycle
  1,234,567,890 branches
     45,678,901 branch-misses      # 3.70% of all branches

This overview shows CPU cycles, instructions per cycle, cache behavior, and branch prediction accuracy. Low instructions per cycle indicates the CPU is stalling — often on memory accesses.

Finding Hot Functions

perf record ./program
perf report

perf record samples the program's execution and records which functions are running at each sample. perf report shows a sorted list of functions by CPU time.

Overhead  Command   Shared Object      Symbol
  65.23%  program   program            [.] process_record
  12.45%  program   libc.so.6          [.] memcpy
   8.91%  program   program            [.] hash_lookup
   5.67%  program   program            [.] parse_line

Annotated Source

perf annotate process_record

This shows the source code (or assembly) of the hot function with the percentage of time spent on each line. You can see exactly which loop or which memory access is the bottleneck.

Valgrind Callgrind: Instruction-Level Profiling

Callgrind is a Valgrind tool that counts every instruction executed. It is slow (20-50x overhead) but deterministic — running it twice produces the same results.

gcc -g -O2 program.c -o program
valgrind --tool=callgrind ./program

This produces a file named callgrind.out.<pid>. View it with:

callgrind_annotate callgrind.out.12345

--------------------------------------------------------------------------------
Ir          file:function
--------------------------------------------------------------------------------
450,000,000 program.c:process_record
 89,000,000 program.c:hash_lookup
 34,000,000 program.c:parse_line
 12,000,000 program.c:main

Ir is instruction reads — the total number of instructions executed in each function. This is deterministic and not affected by system load or timing variations.

KCachegrind Visualization

For graphical analysis:

kcachegrind callgrind.out.12345

KCachegrind shows call graphs, source annotation, and the caller-callee relationship for every function. It is the most detailed profiling visualization available for C.

Flamegraphs

Flamegraphs visualize profiling data as a stack chart. Each horizontal bar represents a function, and its width represents the time spent in that function. Parent functions are below, child functions above.

perf record -g ./program
perf script | stackcollapse-perf.pl | flamegraph.pl > profile.svg

The -g flag records call graphs. The Flamegraph tools (from Brendan Gregg's repository) convert the data into an interactive SVG. Wide bars are bottlenecks. Narrow bars are negligible.

Flamegraphs are particularly effective for finding unexpected costs: a library function that takes 30% of runtime, or a logging function called millions of times that dominates the profile.

Benchmarking: Measuring Correctly

When measuring execution time in C, use clock_gettime, not clock.

#include <stdio.h>
#include <time.h>

double get_time_sec(void) {
    struct timespec ts;
    clock_gettime(CLOCK_MONOTONIC, &ts);
    return ts.tv_sec + ts.tv_nsec * 1e-9;
}

int main(void) {
    double start = get_time_sec();

    /* Code to benchmark */
    volatile long sum = 0;
    for (long i = 0; i < 100000000L; i++) {
        sum += i;
    }

    double end = get_time_sec();
    printf("Elapsed: %.6f seconds\n", end - start);
    printf("Result: %ld\n", sum);

    return 0;
}

Elapsed: 0.234567 seconds
Result: 4999999950000000

Why clock_gettime

clock() measures CPU time, not wall time, and has low resolution on some platforms
clock_gettime(CLOCK_MONOTONIC, ...) measures wall time with nanosecond resolution and is not affected by system clock adjustments
The volatile keyword prevents the compiler from optimizing away the computation being benchmarked

Benchmarking Rules

/* Run the benchmark multiple times and report statistics */
#define ITERATIONS 10

int main(void) {
    double times[ITERATIONS];

    for (int run = 0; run < ITERATIONS; run++) {
        double start = get_time_sec();
        do_work();
        double end = get_time_sec();
        times[run] = end - start;
    }

    /* Sort and report median, not mean */
    /* (mean is skewed by outliers from OS scheduling) */
    qsort(times, ITERATIONS, sizeof(double), compare_double);
    printf("Median: %.6f seconds\n", times[ITERATIONS / 2]);

    return 0;
}

Report the median of multiple runs. The mean is skewed by occasional long runs caused by OS scheduling, garbage collection in other processes, or thermal throttling.

What to Optimize

Amdahl's Law governs optimization: if a function takes 80% of runtime and you make it twice as fast, you save 40% of total runtime. If a function takes 2% of runtime and you make it ten times faster, you save 1.8%. Always optimize the biggest bottleneck first.

Common C Performance Bottlenecks

/* Cache misses: traversing a linked list is slow */
struct Node *curr = head;
while (curr) {
    process(curr->data); /* Each node may be in a different cache line */
    curr = curr->next;
}

/* Cache-friendly: iterating an array is fast */
for (size_t i = 0; i < n; i++) {
    process(array[i]); /* Sequential memory access, hardware prefetch works */
}

Memory access patterns dominate C performance. The CPU cache is 100x faster than main memory. Code that accesses memory sequentially (arrays) is dramatically faster than code that follows pointers (linked lists, trees).

Algorithmic Improvements Beat Micro-optimizations

/* O(n^2) - no amount of micro-optimization helps */
for (size_t i = 0; i < n; i++) {
    for (size_t j = 0; j < n; j++) {
        if (items[j].key == target) { found = 1; break; }
    }
}

/* O(n log n) + O(n) - sort first, then binary search */
qsort(items, n, sizeof(Item), compare_items);
for (size_t i = 0; i < queries; i++) {
    Item *found = bsearch(&target, items, n, sizeof(Item), compare_items);
}

Replacing an O(n^2) algorithm with an O(n log n) algorithm will always outperform hand-optimized assembly for the O(n^2) version, at sufficient input sizes.

Common Pitfalls

Optimizing without profiling — "I think this function is slow" is not evidence. Profile first. The bottleneck is rarely where you expect.
Using clock() for benchmarking — clock() measures CPU time with coarse granularity. Use clock_gettime(CLOCK_MONOTONIC, ...) for precise wall-clock timing.
Benchmarking debug builds — Always profile optimized builds (-O2 or -O3). Debug builds (-O0) have entirely different performance characteristics. The bottleneck in a debug build may not exist in an optimized build.
Single-run benchmarks — One measurement is meaningless. System load, caching effects, and thermal throttling cause variation. Run multiple iterations and report the median.
Micro-optimizing cold code — If a function executes once during startup, making it faster has no measurable impact. Focus on code in hot loops that execute millions of times.
Ignoring cache behavior — Modern CPUs spend most of their time waiting for memory. A data structure that fits in cache can be 100x faster than one that does not, regardless of algorithmic complexity.
Letting the compiler optimize away the benchmark — If the result of a computation is never used, the compiler removes it entirely. Use volatile or print the result to prevent dead code elimination.

Key Takeaways

Profile before optimizing. Measure, change, measure again. Never guess where the bottleneck is.
gprof provides function-level profiling with call counts. It is the simplest tool to start with but has limitations with I/O-bound code.
perf uses hardware counters for precise, low-overhead profiling on Linux. perf record and perf report show exactly where CPU time is spent.
Valgrind's callgrind counts every instruction, giving deterministic results. It is slow but thorough.
Flamegraphs visualize profiling data and make bottlenecks immediately visible as wide horizontal bars.
Use clock_gettime(CLOCK_MONOTONIC, ...) for benchmarking. Run multiple iterations and report the median.
Memory access patterns dominate C performance. Sequential array access is orders of magnitude faster than random pointer chasing due to CPU caching.
Algorithmic improvements outperform micro-optimizations. An O(n log n) algorithm beats a hand-optimized O(n^2) algorithm at sufficient scale.