Abstract Interpretation Applied
Theoretical Foundation
Abstract interpretation, formalized by Cousot and Cousot (1977), provides a systematic framework for constructing sound static analyses by computing over abstract domains that approximate the concrete semantics of programs. Every analysis built in this framework is sound by construction — the abstraction provably overapproximates all possible concrete executions.
Concrete Semantics
The concrete (collecting) semantics of a program maps each program point to the set of all states reachable at that point across all possible executions. For a program with variables x₁...xₙ over integers, a concrete state is a tuple in ℤⁿ, and the collecting semantics assigns to each program point a set S ⊆ ℤⁿ. This is precise but uncomputable — Rice's theorem forbids computing arbitrary reachable state sets.
Abstract Semantics
An abstract domain D♯ approximates sets of concrete states. The connection between concrete and abstract is formalized by a Galois connection:
(℘(Concrete), ⊆) ⟵α/γ⟶ (Abstract, ⊑)
- Abstraction function α: maps concrete sets to their best abstract representation
- Concretization function γ: maps abstract elements to the concrete sets they represent
- Galois connection requirement: α(c) ⊑ a ⟺ c ⊆ γ(a) for all c ∈ ℘(Concrete), a ∈ Abstract
Soundness condition: For every concrete transfer function f and its abstract counterpart f♯, we require α ∘ f ∘ γ ⊑ f♯ — the abstract transfer function overapproximates the concrete one.
Fixed-Point Computation
The abstract semantics is computed as the least fixed point of abstract transfer functions over the abstract domain. Convergence is guaranteed when the abstract domain has finite height (ascending chain condition). For infinite-height domains, widening operators force convergence by extrapolating trends, at the cost of precision. Narrowing operators can then recover some lost precision by iterating downward from the widened result.
Numerical Abstract Domains
Intervals (Box Domain)
The simplest numerical domain. Each variable xᵢ is bounded independently: xᵢ ∈ [aᵢ, bᵢ]. Represents axis-aligned boxes in ℤⁿ or ℝⁿ.
- Operations: Join = componentwise hull [min(a₁,a₂), max(b₁,b₂)]; meet = componentwise intersection; assignment transfer by interval arithmetic.
- Cost: O(n) per operation. Highly scalable.
- Limitation: Cannot express relations between variables. If x ∈ [0,100] and y = x, the interval domain cannot express y = x; it only knows y ∈ [0,100]. The constraint x ≤ y is unrepresentable.
Widening: If a bound is increasing across iterations, widen to ±∞. Standard widening with thresholds uses a predefined set of constants (from program comparisons) as intermediate widening targets, preventing premature jump to infinity.
Octagons
The octagon domain (Miné, 2006) tracks constraints of the form ±xᵢ ± xⱼ ≤ c. This captures relationships between pairs of variables: differences (x - y ≤ 5), sums (x + y ≤ 10), and individual bounds (x ≤ 3 = x + 0·y ≤ 3).
- Representation: Difference bound matrix (DBM) of size 2n × 2n. Each variable x has two forms: x⁺ and x⁻ = -x.
- Closure: Floyd-Warshall shortest-path algorithm computes the tightest implied constraints. O(n³) per closure.
- Cost: O(n³) time, O(n²) space per operation. Practical for hundreds of variables.
- Precision: Strictly more precise than intervals for relational properties between variable pairs. Cannot express x + 2y ≤ 5 (non-unit coefficients).
Polyhedra
The convex polyhedra domain (Cousot and Halbwachs, 1978) tracks arbitrary linear inequalities: Σ aᵢxᵢ ≤ c. The most precise widely-used numerical domain.
- Dual representation: Constraint form (system of inequalities) and generator form (vertices + rays). Conversion between them (Chernikova's algorithm) is exponential in the worst case.
- Operations: Join = convex hull (expensive), meet = conjunction of constraints (cheap), widening = drop non-stable constraints.
- Cost: Exponential in the number of variables in the worst case. Practical for tens of variables.
- Applications: Verifying array bounds, loop invariants involving multiple variables.
Zones (DBMs)
A restricted form of octagons tracking only xᵢ - xⱼ ≤ c (difference constraints, no sums). Equivalent to shortest-path problems. O(n³) closure via Bellman-Ford or Floyd-Warshall. Used extensively in timed automata verification.
Domain Hierarchy
Intervals ⊂ Zones ⊂ Octagons ⊂ Polyhedra (in terms of expressible constraints). Each step up gains precision at increased computational cost.
Non-Numerical Domains
Nullness Analysis
Tracks whether references may be null, must be non-null, or are unknown. Domain: {Null, NonNull, MaybeNull, ⊥}. Simple but high-impact — null pointer dereferences are among the most common bugs. Integrated into tools like Infer and NullAway.
Taint Analysis
Tracks whether values are influenced by untrusted input (tainted) or known-safe (untainted). Sources (user input, network), sinks (SQL queries, system calls), and sanitizers (input validation) define the security policy. Abstract domain: {Tainted, Untainted, ⊤}. Detects injection vulnerabilities (SQL injection, XSS, command injection).
String Domains
Abstract domains for string values: prefix/suffix analysis, regular language abstractions, string length intervals. Applied to detect format string vulnerabilities, validate URL patterns, verify SQL query structure.
Combining Domains
Reduced Product
Given domains D₁ and D₂, the direct product D₁ × D₂ tracks both abstractions simultaneously but independently. The reduced product adds a reduction operator that uses information from one domain to refine the other.
Example: Intervals × Congruences. Interval says x ∈ [0, 10]. Congruence says x ≡ 1 (mod 4). Reduction refines to x ∈ {1, 5, 9}. Neither domain alone captures this.
Reduction is not always computable or efficient. Practical approaches use heuristic reduction at specific program points.
Cofibered Domains
More sophisticated combination operators handle heterogeneous abstractions. The attribute grammar approach composes analyses modularly, each contributing its own domain and transfer functions, with inter-domain communication via shared attributes.
Widening and Narrowing
Widening
Widening ∇ : D × D → D forces convergence of ascending chains. Properties:
- Covering: x ⊔ y ⊑ x ∇ y (overapproximates the join)
- Convergence: For any ascending chain x₀ ⊑ x₁ ⊑ ..., the widened sequence y₀ = x₀, yᵢ₊₁ = yᵢ ∇ xᵢ₊₁ stabilizes in finite steps
Standard widening for intervals: If the upper bound increases, widen to +∞. Dual for lower bound.
Widening with thresholds: Collect comparison constants from the program (e.g., x < 100). Instead of jumping to ∞, widen to the next threshold. Dramatically improves precision for loop bounds.
Delayed widening: Apply standard join for the first k iterations, then switch to widening. Often captures the correct invariant before widening is needed.
Narrowing
After widening yields a post-fixed point (sound but imprecise), narrowing Δ refines it by iterating downward. Properties: x ⊓ y ⊑ x Δ y ⊑ x. A few narrowing iterations often recover precision lost by widening. Not guaranteed to reach the least fixed point, but improvements are always sound.
Loop Unrolling and Trace Partitioning
Trace partitioning (Rival and Mauborgne, 2007): Partition abstract states based on control flow history (which branch was taken, loop iteration count). Maintains separate abstract states per partition, merged only when partitioning budget is exhausted. Prevents join-induced precision loss at loop heads.
Industrial Tools
Astrée
Developed at ENS Paris (Cousot et al.), Astrée targets absence of runtime errors in embedded C code. Used to verify Airbus fly-by-wire software (millions of lines, zero false alarms achieved).
Key features: Specialized numerical domains for floating-point (IEEE 754), relational domains for arrays (array expansion, smashing), trace partitioning, iterator-based domain for loops, filter domain for digital signal processing code. Sound for the full C language (excluding dynamic allocation and recursion in the target domain).
Infer (Meta/Facebook)
Compositional analyzer based on separation logic and bi-abduction. Analyzes each function independently, computing pre/post-condition summaries in separation logic. Summaries compose at call sites.
Analyses: Null dereference (Pulse), memory leaks, thread safety (RacerD), performance (Starvation). Runs incrementally on diffs — analyzes only changed files and their dependents. Deployed on Facebook's mobile codebase (tens of millions of lines), catching thousands of bugs before they reach production.
Pulse: Successor to the original biabduction analysis. Maintains a disjunctive abstract state (set of symbolic heaps) with path conditions. Can detect use-after-free, double-free, and null dereferences with low false positive rates.
Sparrow
Academic analyzer for C programs from Seoul National University. Notable for its parametric analysis framework — sensitivity configurations (flow, context, field, path) can be tuned per-analysis. Demonstrated that selective context sensitivity guided by machine learning outperforms uniform policies.
Trace Abstraction and Refinement
CEGAR (Counterexample-Guided Abstraction Refinement)
Start with a coarse abstraction. If a spurious counterexample is found, refine the abstraction to eliminate it. Iterate until the property is proved or a real counterexample is found.
In abstract interpretation terms: CEGAR refines the abstract domain by adding predicates (in predicate abstraction) or constraints that distinguish the spurious trace from feasible ones. Trace abstraction refinement (Heizmann et al., 2009) generalizes this by computing Floyd-Hoare automata that overapproximate infeasible error traces.
Predicate Abstraction
The abstract domain is defined by a set of predicates over program variables. Abstract states are Boolean combinations of these predicates. Cartesian abstraction tracks each predicate independently; Boolean abstraction tracks arbitrary Boolean combinations (exponentially more precise but more expensive). Predicates are refined via interpolation from infeasible counterexample paths.
Floating-Point Analysis
Floating-point arithmetic violates real-number laws (non-associative, non-distributive, rounding). Sound analysis must account for rounding errors. Approaches:
- Interval arithmetic with directed rounding: Widen intervals to account for rounding.
- Affine arithmetic: Track linear error terms, capturing correlations between rounding errors.
- Fluctuat (CEA): Specialized tool that decomposes floating-point errors into contributions from each operation, guiding developers to numerically unstable code.