Profiling
Profiling tells you where your program spends its time. Without profiling, optimization is guesswork — you will optimize the function you think is slow instead of the function that actually consumes 80% of runtime. Profiling replaces intuition with measurement.
The Profiling Workflow
The correct order is always:
- Write correct code — make it work first
- Measure — profile to find the actual bottleneck
- Optimize the bottleneck — improve only the hot path
- Measure again — confirm the optimization helped
Optimizing before measuring wastes time. Optimizing without measuring afterward means you do not know if you helped or hurt.
gprof: Function-Level Profiling
gprof is the classic profiling tool included with GCC. It measures how much time each function consumes and how many times each function is called.
Using gprof
gcc -pg -g program.c -o program
./program # Generates gmon.out
gprof ./program gmon.out # Generate the report
The -pg flag adds instrumentation code to every function. When the program runs, it writes profiling data to gmon.out. The gprof command reads this data and produces a report.
Reading gprof Output
% cumulative self self total
time seconds seconds calls ms/call ms/call name
72.3 2.89 2.89 100000 0.03 0.03 process_record
15.1 3.49 0.60 1 600.00 600.00 load_data
8.2 3.82 0.33 100000 0.00 0.00 validate_input
4.4 3.99 0.18 1 180.00 3990.00 main
This tells you that process_record takes 72.3% of execution time. That is where optimization effort should go. The function you thought was slow (load_data) is only 15.1%.
gprof Limitations
gprof uses statistical sampling — it interrupts the program periodically and records which function is executing. Short-running functions may be missed. gprof also does not profile time spent in system calls or I/O waits. For more accurate profiling, use perf or callgrind.
perf: Linux Hardware Counter Profiling
perf is the Linux kernel's profiling tool. It uses hardware performance counters built into the CPU, giving precise measurements with minimal overhead.
Basic perf Usage
gcc -g -O2 program.c -o program
perf stat ./program
Performance counter stats for './program':
3,987.42 msec task-clock
2,451 context-switches
89 cpu-migrations
12,345 page-faults
14,234,567,890 cycles
9,876,543,210 instructions # 0.69 insn per cycle
1,234,567,890 branches
45,678,901 branch-misses # 3.70% of all branches
This overview shows CPU cycles, instructions per cycle, cache behavior, and branch prediction accuracy. Low instructions per cycle indicates the CPU is stalling — often on memory accesses.
Finding Hot Functions
perf record ./program
perf report
perf record samples the program's execution and records which functions are running at each sample. perf report shows a sorted list of functions by CPU time.
Overhead Command Shared Object Symbol
65.23% program program [.] process_record
12.45% program libc.so.6 [.] memcpy
8.91% program program [.] hash_lookup
5.67% program program [.] parse_line
Annotated Source
perf annotate process_record
This shows the source code (or assembly) of the hot function with the percentage of time spent on each line. You can see exactly which loop or which memory access is the bottleneck.
Valgrind Callgrind: Instruction-Level Profiling
Callgrind is a Valgrind tool that counts every instruction executed. It is slow (20-50x overhead) but deterministic — running it twice produces the same results.
gcc -g -O2 program.c -o program
valgrind --tool=callgrind ./program
This produces a file named callgrind.out.<pid>. View it with:
callgrind_annotate callgrind.out.12345
--------------------------------------------------------------------------------
Ir file:function
--------------------------------------------------------------------------------
450,000,000 program.c:process_record
89,000,000 program.c:hash_lookup
34,000,000 program.c:parse_line
12,000,000 program.c:main
Ir is instruction reads — the total number of instructions executed in each function. This is deterministic and not affected by system load or timing variations.
KCachegrind Visualization
For graphical analysis:
kcachegrind callgrind.out.12345
KCachegrind shows call graphs, source annotation, and the caller-callee relationship for every function. It is the most detailed profiling visualization available for C.
Flamegraphs
Flamegraphs visualize profiling data as a stack chart. Each horizontal bar represents a function, and its width represents the time spent in that function. Parent functions are below, child functions above.
perf record -g ./program
perf script | stackcollapse-perf.pl | flamegraph.pl > profile.svg
The -g flag records call graphs. The Flamegraph tools (from Brendan Gregg's repository) convert the data into an interactive SVG. Wide bars are bottlenecks. Narrow bars are negligible.
Flamegraphs are particularly effective for finding unexpected costs: a library function that takes 30% of runtime, or a logging function called millions of times that dominates the profile.
Benchmarking: Measuring Correctly
When measuring execution time in C, use clock_gettime, not clock.
#include <stdio.h>
#include <time.h>
double get_time_sec(void) {
struct timespec ts;
clock_gettime(CLOCK_MONOTONIC, &ts);
return ts.tv_sec + ts.tv_nsec * 1e-9;
}
int main(void) {
double start = get_time_sec();
/* Code to benchmark */
volatile long sum = 0;
for (long i = 0; i < 100000000L; i++) {
sum += i;
}
double end = get_time_sec();
printf("Elapsed: %.6f seconds\n", end - start);
printf("Result: %ld\n", sum);
return 0;
}
Elapsed: 0.234567 seconds
Result: 4999999950000000
Why clock_gettime
clock()measures CPU time, not wall time, and has low resolution on some platformsclock_gettime(CLOCK_MONOTONIC, ...)measures wall time with nanosecond resolution and is not affected by system clock adjustments- The
volatilekeyword prevents the compiler from optimizing away the computation being benchmarked
Benchmarking Rules
/* Run the benchmark multiple times and report statistics */
#define ITERATIONS 10
int main(void) {
double times[ITERATIONS];
for (int run = 0; run < ITERATIONS; run++) {
double start = get_time_sec();
do_work();
double end = get_time_sec();
times[run] = end - start;
}
/* Sort and report median, not mean */
/* (mean is skewed by outliers from OS scheduling) */
qsort(times, ITERATIONS, sizeof(double), compare_double);
printf("Median: %.6f seconds\n", times[ITERATIONS / 2]);
return 0;
}
Report the median of multiple runs. The mean is skewed by occasional long runs caused by OS scheduling, garbage collection in other processes, or thermal throttling.
What to Optimize
Amdahl's Law governs optimization: if a function takes 80% of runtime and you make it twice as fast, you save 40% of total runtime. If a function takes 2% of runtime and you make it ten times faster, you save 1.8%. Always optimize the biggest bottleneck first.
Common C Performance Bottlenecks
/* Cache misses: traversing a linked list is slow */
struct Node *curr = head;
while (curr) {
process(curr->data); /* Each node may be in a different cache line */
curr = curr->next;
}
/* Cache-friendly: iterating an array is fast */
for (size_t i = 0; i < n; i++) {
process(array[i]); /* Sequential memory access, hardware prefetch works */
}
Memory access patterns dominate C performance. The CPU cache is 100x faster than main memory. Code that accesses memory sequentially (arrays) is dramatically faster than code that follows pointers (linked lists, trees).
Algorithmic Improvements Beat Micro-optimizations
/* O(n^2) - no amount of micro-optimization helps */
for (size_t i = 0; i < n; i++) {
for (size_t j = 0; j < n; j++) {
if (items[j].key == target) { found = 1; break; }
}
}
/* O(n log n) + O(n) - sort first, then binary search */
qsort(items, n, sizeof(Item), compare_items);
for (size_t i = 0; i < queries; i++) {
Item *found = bsearch(&target, items, n, sizeof(Item), compare_items);
}
Replacing an O(n^2) algorithm with an O(n log n) algorithm will always outperform hand-optimized assembly for the O(n^2) version, at sufficient input sizes.
Common Pitfalls
- Optimizing without profiling — "I think this function is slow" is not evidence. Profile first. The bottleneck is rarely where you expect.
- Using
clock()for benchmarking —clock()measures CPU time with coarse granularity. Useclock_gettime(CLOCK_MONOTONIC, ...)for precise wall-clock timing. - Benchmarking debug builds — Always profile optimized builds (
-O2or-O3). Debug builds (-O0) have entirely different performance characteristics. The bottleneck in a debug build may not exist in an optimized build. - Single-run benchmarks — One measurement is meaningless. System load, caching effects, and thermal throttling cause variation. Run multiple iterations and report the median.
- Micro-optimizing cold code — If a function executes once during startup, making it faster has no measurable impact. Focus on code in hot loops that execute millions of times.
- Ignoring cache behavior — Modern CPUs spend most of their time waiting for memory. A data structure that fits in cache can be 100x faster than one that does not, regardless of algorithmic complexity.
- Letting the compiler optimize away the benchmark — If the result of a computation is never used, the compiler removes it entirely. Use
volatileor print the result to prevent dead code elimination.
Key Takeaways
- Profile before optimizing. Measure, change, measure again. Never guess where the bottleneck is.
- gprof provides function-level profiling with call counts. It is the simplest tool to start with but has limitations with I/O-bound code.
perfuses hardware counters for precise, low-overhead profiling on Linux.perf recordandperf reportshow exactly where CPU time is spent.- Valgrind's callgrind counts every instruction, giving deterministic results. It is slow but thorough.
- Flamegraphs visualize profiling data and make bottlenecks immediately visible as wide horizontal bars.
- Use
clock_gettime(CLOCK_MONOTONIC, ...)for benchmarking. Run multiple iterations and report the median. - Memory access patterns dominate C performance. Sequential array access is orders of magnitude faster than random pointer chasing due to CPU caching.
- Algorithmic improvements outperform micro-optimizations. An O(n log n) algorithm beats a hand-optimized O(n^2) algorithm at sufficient scale.