9 min read
On this page

Topological Data Analysis

Overview

Topological data analysis (TDA) applies algebraic topology to study the "shape" of data. The central insight: topological features (connected components, loops, voids) captured at multiple scales reveal structural patterns invisible to traditional statistical methods. TDA is coordinate-free, invariant to continuous deformations, and robust to noise.

Simplicial Complexes

A simplicial complex KK is a collection of simplices (vertices, edges, triangles, tetrahedra, ...) closed under taking faces: if σK\sigma \in K and τσ\tau \subseteq \sigma, then τK\tau \in K.

A kk-simplex σ=[v0,v1,,vk]\sigma = [v_0, v_1, \ldots, v_k] is the convex hull of k+1k+1 affinely independent points. Its faces are simplices formed by subsets of its vertices.

Vietoris-Rips Complex

Given a point cloud PP and parameter ϵ>0\epsilon > 0:

VRϵ(P)={σP:d(pi,pj)ϵ for all pi,pjσ}\text{VR}_\epsilon(P) = \{\sigma \subseteq P : d(p_i, p_j) \leq \epsilon \text{ for all } p_i, p_j \in \sigma\}

A simplex is included iff all pairwise distances are at most ϵ\epsilon. Equivalently, a subset forms a simplex iff the set of ϵ\epsilon-balls centered at its points has a common intersection (clique complex of the ϵ\epsilon-neighborhood graph).

Properties:

  • Determined entirely by pairwise distances (metric data suffices)
  • Flag complex: determined by its 1-skeleton (edge graph). This is computationally advantageous
  • Can contain simplices of dimension d\geq d even for points in Rd\mathbb{R}^d
  • Expensive to compute/store: up to 2n2^n simplices in the worst case. In practice, impose a maximum dimension

Relationship to Cech: Cechϵ(P)VRϵ(P)Cech2ϵ(P)\text{Cech}_\epsilon(P) \subseteq \text{VR}_\epsilon(P) \subseteq \text{Cech}_{\sqrt{2}\epsilon}(P) (by the nerve lemma and Jung's theorem). They yield the same persistent homology at scale 2\sqrt{2}.

Cech Complex

Cechϵ(P)={σP:pσB(p,ϵ/2)}\text{Cech}_\epsilon(P) = \{\sigma \subseteq P : \bigcap_{p \in \sigma} B(p, \epsilon/2) \neq \emptyset\}

A simplex is included iff the balls of radius ϵ/2\epsilon/2 centered at its vertices have a common intersection. By the nerve lemma, the Cech complex has the same homotopy type as the union of balls pPB(p,ϵ/2)\bigcup_{p \in P} B(p, \epsilon/2).

More geometrically faithful than Rips but computationally harder (requires checking higher-order intersections, not just pairwise distances). Miniball algorithms determine simplex inclusion.

Alpha Complex

The alpha complex αϵ(P)\alpha_\epsilon(P) is the subcomplex of the Delaunay triangulation consisting of simplices whose circumradius is at most ϵ\epsilon:

αϵ(P)={σDel(P):circumradius(σ)ϵ}\alpha_\epsilon(P) = \{\sigma \in \text{Del}(P) : \text{circumradius}(\sigma) \leq \epsilon\}

Properties:

  • Homotopy equivalent to the union of balls (same as Cech), by the nerve lemma applied to the Voronoi-restricted balls
  • Much smaller than the Rips or Cech complex: at most O(nd/2)O(n^{\lceil d/2 \rceil}) simplices (the size of the Delaunay triangulation)
  • Efficient for low-dimensional data (d3d \leq 3)
  • Embedded in ambient space (simplices are geometric, not abstract)
  • For d=2d = 2: at most O(n)O(n) simplices. For d=3d = 3: at most O(n2)O(n^2)

Weighted alpha complexes: use weighted Delaunay (regular) triangulations. Accommodate points with different "radii" (e.g., atoms with van der Waals radii in molecular biology).

Cubical Complexes

For gridded data (images, volumes), cubical complexes are more natural than simplicial ones. Elementary cubes (pixels/voxels and their faces) form the cells. Persistent homology on cubical complexes avoids triangulation entirely.

Efficient for image analysis, medical imaging (CT/MRI), and volumetric data.

Persistent Homology

Homology Groups (Brief Review)

The kk-th homology group Hk(K)H_k(K) of a simplicial complex KK captures kk-dimensional "holes":

  • H0H_0: connected components
  • H1H_1: loops (1-cycles not bounding a 2-chain)
  • H2H_2: voids (2-cycles not bounding a 3-chain)

The rank of HkH_k is the kk-th Betti number βk\beta_k: the number of independent kk-dimensional holes.

Homology is computed over a coefficient field (typically Z2\mathbb{Z}_2 for simplicity, or Z\mathbb{Z} for orientability information). With Z2\mathbb{Z}_2 coefficients, chains are formal sums of simplices mod 2, and the boundary operator k\partial_k maps kk-chains to (k1)(k-1)-chains.

Hk=kerk/imk+1H_k = \ker \partial_k / \text{im} \, \partial_{k+1}

Filtrations

A filtration is a nested sequence of simplicial complexes:

=K0K1K2Km=K\emptyset = K_0 \subseteq K_1 \subseteq K_2 \subseteq \cdots \subseteq K_m = K

Common filtrations:

  • Rips filtration: VRϵ1VRϵ2\text{VR}_{\epsilon_1} \subseteq \text{VR}_{\epsilon_2} \subseteq \cdots for increasing ϵ\epsilon
  • Alpha filtration: αϵ1αϵ2\alpha_{\epsilon_1} \subseteq \alpha_{\epsilon_2} \subseteq \cdots
  • Sublevel set filtration: for a function f:KRf: K \to \mathbb{R}, the sublevel sets f1(,t]f^{-1}(-\infty, t] form a filtration as tt increases. Used for scalar field topology

Birth, Death, and Persistence

As the filtration parameter increases, topological features appear (birth) and disappear (death):

  • A new connected component appears when a vertex is added (birth of 0-cycle)
  • Two components merge when an edge connects them (death of younger component --- elder rule)
  • A loop appears when an edge closes a cycle (birth of 1-cycle)
  • A loop is filled when a triangle (2-simplex) is added (death of 1-cycle)

The persistence of a feature is deathbirth\text{death} - \text{birth}. Long-lived features represent genuine topological structure; short-lived features are noise.

Persistence Diagrams

A persistence diagram Dgmk\text{Dgm}_k is a multiset of points (bi,di)(b_i, d_i) in the plane, where each point represents a kk-dimensional homological feature with birth bib_i and death did_i.

  • Points near the diagonal (dbd \approx b) represent noise
  • Points far from the diagonal represent significant features
  • Features that never die are plotted at d=d = \infty (essential classes)
  • The diagram lies above the diagonal (d>bd > b)

Barcodes

An equivalent representation: each feature is a horizontal interval [bi,di)[b_i, d_i) (a "bar"). The collection of bars is the barcode. Barcodes are often more visually intuitive than diagrams.

Computation (The Standard Algorithm)

Matrix reduction: represent the filtration boundary operator as a matrix and reduce it (column operations) to identify persistent pairs.

Given the boundary matrix DD (columns ordered by filtration time), reduce it by adding columns left-to-right (analogous to column echelon form over Z2\mathbb{Z}_2):

for j = 1 to m:
    while exists j' < j with low(j') = low(j):
        D[:, j] += D[:, j']  (mod 2)

where low(j)\text{low}(j) is the row index of the lowest nonzero entry in column jj. A persistence pair is (low(j),j)(\text{low}(j), j): the simplex at index low(j)\text{low}(j) creates the feature, the simplex at index jj destroys it.

Complexity: O(m3)O(m^3) worst case for mm simplices, but O(mω)O(m^\omega) via matrix multiplication speedups. In practice, much faster due to sparsity (often nearly linear).

Optimizations:

  • Clearing optimization: after identifying a pair, clear the corresponding column (set to zero). Reduces work by ~50%
  • Twist optimization: combine clearing with cohomology-based persistence (Bauer)
  • Chunk reduction: parallelize by processing independent blocks
  • Apparent and emergent pairs: identify trivially paired simplices before reduction

Stability Theorem

Bottleneck stability (Cohen-Steiner, Edelsbrunner, Harer, 2007): if f,g:KRf, g: K \to \mathbb{R} are two filtration functions:

dB(Dgm(f),Dgm(g))fgd_B(\text{Dgm}(f), \text{Dgm}(g)) \leq \|f - g\|_\infty

where dBd_B is the bottleneck distance between persistence diagrams (the infimum over all matchings of the maximum matched distance, with unmatched points paired to the diagonal).

This guarantees that small perturbations of the data produce small changes in the persistence diagram. Persistence diagrams are robust topological summaries.

Wasserstein stability: the pp-Wasserstein distance WpW_p also satisfies stability bounds, providing finer control for applications requiring soft matching.

Betti Numbers

The kk-th Betti number βk\beta_k counts the number of kk-dimensional holes:

  • β0\beta_0: number of connected components
  • β1\beta_1: number of independent loops/tunnels
  • β2\beta_2: number of enclosed voids/cavities

Euler characteristic: χ=k=0d(1)kβk=k=0d(1)knk\chi = \sum_{k=0}^d (-1)^k \beta_k = \sum_{k=0}^d (-1)^k n_k where nkn_k is the number of kk-simplices. Computable in linear time (just count simplices), but coarser than full homology.

Persistent Betti numbers: βks,t\beta_k^{s,t} counts the number of kk-features born at or before ss that are still alive at tt. The persistence diagram encodes the complete persistent Betti number function.

Mapper Algorithm

Mapper (Singh, Memoli, Carlsson, 2007): a topological network summarization tool for high-dimensional data.

Algorithm:

  1. Choose a filter function f:XRf: X \to \mathbb{R} (or Rk\mathbb{R}^k) --- a lens to view the data through (e.g., density, PCA coordinate, eccentricity)
  2. Cover the range of ff with overlapping intervals U1,U2,U_1, U_2, \ldots
  3. For each interval UiU_i, cluster the preimage f1(Ui)f^{-1}(U_i) using any clustering algorithm
  4. Build a nerve: create a node for each cluster, and an edge between nodes whose clusters share data points (from overlapping intervals)

Output: a graph (or simplicial complex) summarizing the topological structure of the data.

Parameter choices:

  • Filter function: problem-dependent. Density (DBSCAN), Gaussian density, PCA coordinates, graph Laplacian coordinates, L-infinity centrality
  • Number of intervals and overlap percentage (typically 20-50% overlap)
  • Clustering algorithm and its parameters

Applications:

  • Breast cancer subtype identification (Nicolau, Levine, Carlsson, 2011): Mapper revealed a new subgroup with 100% survival
  • Diabetes patient stratification
  • Gene expression analysis
  • NBA player analysis (playing style topology)

Kepler Mapper: standard Python implementation. Interactive visualization.

Vectorization of Persistence Diagrams

Persistence diagrams are not directly usable in ML (they are multisets, not vectors). Vectorization methods:

  • Persistence landscapes (Bubenik, 2015): transform diagrams into functions in a Banach space. Statistical tests are valid. The kk-th landscape function is the kk-th largest value of the tent functions centered at diagram points
  • Persistence images (Adams et al., 2017): weight diagram points by a function, convolve with Gaussian, discretize on a grid. Stable and differentiable
  • Persistence statistics: summarize by total persistence, entropy, polynomial features
  • Kernel methods: persistence weighted Gaussian kernel, sliced Wasserstein kernel, persistence Fisher kernel

Software

Ripser

Ripser (Bauer, 2021): the fastest software for computing Vietoris-Rips persistent homology.

Key algorithmic innovations:

  • Implicit representation: never explicitly stores the full boundary matrix. Computes matrix entries on-the-fly from the distance matrix
  • Apparent pairs: identifies persistence pairs directly from cofacet/facet relationships without reduction (handles the majority of pairs)
  • Clearing optimization: zeros out columns corresponding to identified birth simplices
  • Cohomology: computes persistent cohomology (dual of homology), which is more efficient for sparse data

Handles tens of thousands of points in minutes. Implementations: C++ (original), Ripser.py (Python bindings), Ripser++ (GPU-accelerated), Cubical Ripser (for gridded data).

GUDHI

GUDHI (Geometry Understanding in Higher Dimensions): comprehensive C++/Python library for TDA.

Features:

  • Multiple complex constructions: Rips, Cech, Alpha, witness, tangential, cubical
  • Persistent homology with various coefficient fields
  • Bottleneck and Wasserstein distances
  • Persistence representations (landscapes, images)
  • Nerve and graph-induced complexes
  • Topological inference and geometric tools

Other Tools

  • Dionysus2: C++/Python library by Dmitriy Morozov. Supports zigzag persistence, vineyards (tracking diagrams as data changes)
  • PHAT: optimized boundary matrix reduction (various strategies: standard, twist, chunk, spectral sequence)
  • Giotto-tda: scikit-learn compatible TDA library. Integrates persistence into ML pipelines
  • TDA (R package): R interface to multiple backends (GUDHI, Dionysus, PHAT)
  • Eirene: Julia library for persistent homology
  • scikit-tda: Python ecosystem unifying Ripser, Kepler Mapper, persim (persistence diagram utilities)

Applications

Shape Analysis

Persistent homology captures shape features (holes, tunnels, voids) across scales. Used in:

  • 3D shape retrieval and classification
  • Protein structure analysis (alpha complexes of atomic coordinates)
  • Material science (pore structure characterization)

Time Series and Signals

Takens' embedding: embed a time series x(t)x(t) into a point cloud via delay coordinates (x(t),x(t+τ),,x(t+(d1)τ))(x(t), x(t+\tau), \ldots, x(t+(d-1)\tau)). Persistent homology of this point cloud detects:

  • Periodicity (H1H_1 features with long persistence)
  • Quasiperiodicity (torus structure, H1H_1 with rank 2)
  • Chaos (higher complexity)

Machine Learning Integration

  • Topological features for classification: compute persistence diagrams, vectorize, feed to classifiers (SVM, random forest, neural network)
  • Topological loss functions: penalize or encourage topological features during training. Used in image segmentation to enforce correct topology (connected components, no spurious holes)
  • Topological autoencoders: regularize latent spaces to preserve topological features of the data
  • Persistent homology of neural network loss landscapes: characterize the topology of sublevel sets of the loss function